EdinburghGenomics / Analysis-Driver

Pipelines for Illumina HiSeqX demultiplexing, sequence QC and variant calling.
8 stars 0 forks source link

Integration test: Create new repo for integretion test datasets. #294

Open tcezard opened 6 years ago

tcezard commented 6 years ago

Looking into github policies we should be able to upload the stripped down dataset and maybe the full size as it would not breech github policy. see https://help.github.com/articles/working-with-large-files/ and https://help.github.com/articles/conditions-for-large-files/ This would potentially allow us to version the md5/qc associated with the results alongside.

mwhamgenomics commented 6 years ago

If we do this, we would need to find a way of making file checksums deterministic regardless of run location. This might mean doing something like cat-ing the file and piping into md5 via a blacklist consisting of, e.g:

vcf:
    '^##GATKCommandLine.+$'
    '^##reference.+$'
bam:
    '^@PG +ID:GATK IndelRealigner.+$'
samtools_stats:
    '^# The command line was:.+$'
tcezard commented 6 years ago

md5 and metrics of demultiplexing of full size dataset can be found in logs/integration_tests/2017-11-21_11\:48\:23.log