broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 594 forks source link

PrintReads silently drops some reads when .bam sequence dictionary has different order than the reads themselves #6065

Open bw2 opened 5 years ago

bw2 commented 5 years ago

Bug Report

Affected tool(s) or class(es)

PrintReads, and probably most other ROD-based tools

Affected version(s)

GATK v4.1.0

Description

I downloaded a .bam from SRA (https://www.ncbi.nlm.nih.gov/sra/SRX4114173[accn]) and ran gatk PrintReads to extract subregions based on a picard-style interval list.

The bug is that PrintReads ran without any warnings or errors and silently dropped some (though not all) reads that it should have included based on the interval list. It does include these reads if I run it with an interval list that just contains that one interval I'm interested in, but not if I include it among many other intervals.

The interval list is sorted based on the .bam's sequence dictionary (by running picard BedToIntervalList --SEQUENCE_DICTIONARY ../SRR7205167.1.bam --SORT -I GRCh38_intervals.bed -O GRCh38_intervals.sorted.list).

The underlying issue as far as I can tell, is that the .bam reads are sorted, but not in the same order as its sequence dictionary.

This might be related to https://github.com/broadinstitute/gatk/issues/101

Expected behavior

I think GATK should fail with an error when this occurs. Otherwise it's easy for users to miss the data loss and end up with incorrect analyses.

Actual behavior

Silently drops data.

lbergelson commented 5 years ago

@bw2 Are you saying that the bam is sorted in a different order than the sequence dictionary in that bam?

lbergelson commented 5 years ago

Does the bam pass ValidateSamFile?

bw2 commented 4 years ago

I'm not sure at this point. Is it possible the issue would also occur if the bam passes ValidateSamFile, and the intervals file is sorted, but the 2 have different contig orderings?