Open tcezard opened 6 years ago
If we do this, we would need to find a way of making file checksums deterministic regardless of run location. This might mean doing something like cat
-ing the file and piping into md5 via a blacklist consisting of, e.g:
vcf:
'^##GATKCommandLine.+$'
'^##reference.+$'
bam:
'^@PG +ID:GATK IndelRealigner.+$'
samtools_stats:
'^# The command line was:.+$'
md5 and metrics of demultiplexing of full size dataset can be found in logs/integration_tests/2017-11-21_11\:48\:23.log
Looking into github policies we should be able to upload the stripped down dataset and maybe the full size as it would not breech github policy. see https://help.github.com/articles/working-with-large-files/ and https://help.github.com/articles/conditions-for-large-files/ This would potentially allow us to version the md5/qc associated with the results alongside.