broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Maintain reference md5s from bam to gvcf #5746

Open EvanTheB opened 5 years ago

EvanTheB commented 5 years ago

GATK at some point started printing the reference contig md5s in the bam header. This is great.

@SQ     SN:chr1 LN:248956422    AS:38   M5:6aef897c3d6ff0c78aff06ac189178dd     UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta     SP:Homo sapiens

However when I use HaplotypeCaller to create my gvcf, only the reference length and name is shown, the md5 is dropped. I don't know if this is for a technical reason, but it would be great to add the md5s to the gvcf header.

##contig=<ID=chr1,length=248956422,assembly=38>

Maybe if they cannot be added as a contig line, they can be added as a comment line in the gvcf header.

cmnbroad commented 5 years ago

Needs verification, but very likely this is due to https://github.com/samtools/htsjdk/issues/730. This was fixed in https://github.com/samtools/htsjdk/pull/835, which was never merged, but could be fixed independently.