broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 591 forks source link

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

Closed ilyasoifer closed 4 months ago

ilyasoifer commented 7 months ago

Bug Report

Affected tool(s) or class(es)

gatk PrintReads, picard MarkDuplicates

Affected version(s)

Description

I apologize in advance if this issue belongs to htsjdk. When we work with some of the CRAMs and pass them through PrintReads or picard MarkDuplicates, "N" bases get introduced.

We think that the problem happens when PrintReads write the CRAM rather than reading it, because if the output of PrintReads is a BAM, it does not happen.

We also noticed that this issue does not happen with earlier GATK (4.2.6.1), HTSJDK 2.24.1.

Happy to share the input files

Steps to reproduce

gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.cram -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.bam -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
samtools view ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226   0   MT  3470    60  54M *   0   0   GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT  DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD  AS:i:54 X1:i:0  XD:Z:CATGAGG_GGTGATC    a3:i:92 bi:Z:5346565226 f1:Z:CATGAGG    f2:Z:GGTGATC    i1:Z:Q44    i2:Z:Q27    pr:i:22 pt:i:15 px:i:3813   py:i:1262   rq:f:0.03   si:i:3750   tm:Z:AQtq:i:195 MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0  NM:i:54 RG:Z:Z0011

samtools view ultMerge.mt.gatk_printreads.bam | grep "038958_1-Z0011-5346565226"

038958_1-Z0011-5346565226   0   MT  3470    60  54M *   0   0   TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA  DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD  X1:i:0  f1:Z:CATGAGG    i1:Z:Q44    f2:Z:GGTGATC    i2:Z:Q27    a3:i:92 XD:Z:CATGAGG_GGTGATC    RG:Z:Z0011  AS:i:54 bi:Z:5346565226 si:i:3750   tm:Z:AQ rq:f:0.03   tq:i:195    pr:i:22pt:i:15  px:i:3813   py:i:1262

Expected behavior

BAM and CRAM outputs should behave the same

Actual behavior

BAM and CRAM outputs behave differently

gokalpcelik commented 7 months ago

These reads don't even look the same ?

GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT
TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA
ilyasoifer commented 7 months ago

That is true! Should I update the title of the issue to reflect this?

gokalpcelik commented 7 months ago

Can you check the REF_CACHE? Or alternatively you may provide samtools the reference fasta during cram view.

ilyasoifer commented 7 months ago

Thanks @gokalpcelik! Seems that -T does not affect much:

samtools view -T /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226   0   MT  3470    60  54M *   0   0   GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT  DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD  AS:i:54 X1:i:0  XD:Z:CATGAGG_GGTGATC    a3:i:92 bi:Z:5346565226f1:Z:CATGAGG f2:Z:GGTGATC    i1:Z:Q44    i2:Z:Q27    pr:i:22 pt:i:15 px:i:3813   py:i:1262   rq:f:0.03   si:i:3750   tm:Z:AQ tq:i:195    MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0  NM:i:54 RG:Z:Z0011
cmnbroad commented 7 months ago

@ilyasoifer Is there any way I can access the original cram (or better yet, a small subset thereof consisting of just MT) that illustrates this issue) and the reference ? It might be hard to debug without that. If thats not possible, a few suggestions: can you try using PrintReads to write the original cram (I would try just MT) first to a cram, then to a sam, and also the original cram to a sam, and see how those compare? It would also be useful to see what that read looks like if you use samtools view on the ORIGINAL cram. Do you know what software/version was used to write the original cram ?

ilyasoifer commented 7 months ago

Hi @cmnbroad - thanks! I can upload to one of the buckets that are shared between Ultima and the Broad. Could you provide your gcp account so I can give you permissions? I am guessing that Megan Shand can share it through our joint slack channel if you prefer.

The original file was created using samtools v1.17. I provided above the result of writing BAM and Cram with PrintReads. Is it helpful or you prefer SAM? And if I just do samtools view ultMerge.mt.cram I get

038958_1-Z0011-5346565226   0   MT  3470    60  54M *   0   0   TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA  DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD  bi:Z:5346565226 rq:f:0.03   pr:i:22 pt:i:15 px:i:3813   py:i:1262   si:i:3750   tq:i:195    tm:Z:AQ i1:Z:Q44    f1:Z:CATGAGG    i2:Z:Q27    f2:Z:GGTGATC    a3:i:92 XD:Z:CATGAGG_GGTGATC    X1:i:0  AS:i:54 MD:Z:54 NM:i:0  RG:Z:Z0011
cmnbroad commented 7 months ago

@ilyasoifer cnorman@broadinstitute.org. And don't worry about doing the PrintReads conversions I requested - if I have access to the original file and the reference I can debug this directly.

ilyasoifer commented 7 months ago

@cmnbroad, OK, great. Shared the files with Megan, she will transfer them over!

ilyasoifer commented 7 months ago

@cmnbroad - the reference is from here: gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/

cmnbroad commented 7 months ago

Thanks for reporting this @ilyasoifer - it looks like a serious bug. I see whats going on and am working on a fix.

ilyasoifer commented 7 months ago

Thank you, good to hear that it was useful

ilyasoifer commented 6 months ago

@cmnbroad - hope all is well. I was wondering if there is any ETA when the fix that you made will be released?

Thanks! Ilya

droazen commented 6 months ago

@ilyasoifer It will be part of the GATK 4.6 release, which should come out this month. We've also implemented a cram scanning tool that users can use to scan their crams to see if any are affected.

ilyasoifer commented 5 months ago

Ah, great, thanks for updating me!

droazen commented 4 months ago

Summary information about this bug:

This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0. The PR with the fix is: https://github.com/broadinstitute/gatk/pull/8900

This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.

This bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:

When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.

Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.

The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.

A CRAM scanning tool called CRAMIssue8768Detector that can detect whether a particular CRAM file is affected by this bug was added in https://github.com/broadinstitute/gatk/pull/8819, and was released as part of GATK 4.6.0.0