cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Ensure that collated subdirectories are archived deterministically #1150

Open Donaim opened 2 months ago

Donaim commented 2 months ago

Currently, even if every output file of the main pipeline is not changed between multiple reruns, when we archive coverage maps, the archives themselves will be different in terms of their binary contents.

This is likely due to the fact that our archival program, tar, is capturing timestamps (ctime, mtime, atime) and stores them in the archive.

As the result, it is always true that Kive reports a change between reruns, even if they produced identical results.

For the fix, we should ensure that tar behaves time-independently.

donkirkby commented 2 months ago

Another option is to deploy the next release of Kive, so we can have output directories, instead of only output files. That way, we won't need any tar or zip files.

I can't remember if we also had a problem with the PDF files' time stamps, or if we used a constant timestamp to avoid the problem.

Donaim commented 2 months ago

@donkirkby yes, PDF files have the same issue. See container_runs/1408362/