cfe-lab / Kive

Archival and automation of bioinformatic pipelines and data
https://cfe-lab.github.io/Kive
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Expiry date / label to keep output files of certain jobs #1870

Open CBeelen opened 1 year ago

CBeelen commented 1 year ago

In MiCall, we use Kive to run different pipelines, and feed the output files of some of them to others as input files. Specifically, we first run the main and de novo pipeline, and when they are finished, subsequent runs of the resistance and proviral pipeline can be triggered. These pipelines use some of the outputs of the main and de novo pipelines as input files.

Usually, the runs' end and start dates are very close, so the previous runs' results are not cleaned up yet when the subsequent run is started and needs to find its input files. However, the resistance pipeline may need input from two separate samples. Sometimes, one of the samples takes a much longer time to finish running, and when it finally finishes, the results of the other samples have already been cleaned up. This causes an error, see cfe-lab/MiCall#921.

A possible solution would be to introduce an option for an expiry date on jobs, to make sure their results stay around for as long as we need them. When the related job is still running, it can periodically update this expiry date / label to keep the output data. That way, the results will be kept for as long as we need them, but they won't stick around forever in case the related job dies.