broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

DownsampleSam discards NM tag #8558

Open bw2 opened 1 year ago

bw2 commented 1 year ago

Affected tool(s) or class(es)

gatk DownsampleSam

Affected version(s)

GATK v4.3.0.0

Description

Input cram file (gs://broad-public-datasets/CHM1_CHM13_WGS2/CHM1_CHM13_WGS2.cram) has NM tags, but the downsampled output file no longer has them. My command-line is

gatk DownsampleSam REFERENCE_SEQUENCE=/hg38.fa I=CHM1_CHM13_WGS2.cram P=0.5 CREATE_INDEX=true O=CHM1_CHM13_WGS2.downsampled.bam 

Some downstream tools require NM tags, so I have to run

samtools calmd CHM1_CHM13_WGS2.downsampled.bam /hg38.fa

to re-add it.

cmnbroad commented 1 year ago

@bw2 For better or worse, htsjdk tries to maintain round-trip fidelity for CRAMs. I took a look at the first few slices of the CRAM referenced above, and it does not appear to contain NM or MD tags. Can you let me know how you concluded that it does ?