chime-experiment / ch_pipeline

CHIME Pipeline
MIT License
5 stars 2 forks source link

Don't use tags to identify processed data #21

Open tristpinsm opened 5 years ago

tristpinsm commented 5 years ago

At the moment processing tasks are divided up into individual jobs, each assigned a tag which is meant to uniquely identify the subset of the data this job processes. The ProcessedType subclass determines all tags available for processing and when a tag is present in the output directory it will not be run again.

A potential problem with this scheme is how to handle data becoming available for processing within the time frame of a revision. This can happen for example with the holography transit processing. Processing a single transit is pretty quick so it makes sense to run a batch of them through a single pipeline script. The way I've implemented this is to divide up all transits into smaller groups and run those within a pipeline job, each identified by a tag. If at the time a tag is run, some of the files included in that group are not available on cedar, it will create an incomplete product, but there is no way to track this, and that tag will not be run again even if the data becomes available at some later time. Similarly, running the pipeline in near real time will be difficult because the tags would need to be changing constantly.

All this is made more complicated if we include software version tracking. Software versions are being tracked for every tag, corresponding to a pipeline run. If we were to allow tags to be labelled incomplete and resumed later, we would now need to track versions for multiple runs within one tag.

I think the solution might be to decouple the configuration/version tracking from the task of identifying data that hasn't been processed yet. We could keep the current system for running tasks into their own output directory and storing there a record of the config and versions that produced it, but make it possible to identify data to be processed in the subclasses by scraping the contents of the output directories, not just the directory names. That would produce a config to be run (a tag, or maybe call it a run or batch). It should be possible to reproduce what exists for the daily processing with such a scheme, but for the holography transits or other types it would be much more flexible.

tristpinsm commented 5 years ago

@jrs65 If this sounds sensible I could try and implement it when I finish up the holography processing type.