Open bw2 opened 5 years ago
@bw2 Are you saying that the bam is sorted in a different order than the sequence dictionary in that bam?
Does the bam pass ValidateSamFile
?
I'm not sure at this point. Is it possible the issue would also occur if the bam passes ValidateSamFile, and the intervals file is sorted, but the 2 have different contig orderings?
Bug Report
Affected tool(s) or class(es)
PrintReads, and probably most other ROD-based tools
Affected version(s)
GATK v4.1.0
Description
I downloaded a .bam from SRA (https://www.ncbi.nlm.nih.gov/sra/SRX4114173[accn]) and ran gatk PrintReads to extract subregions based on a picard-style interval list.
The bug is that PrintReads ran without any warnings or errors and silently dropped some (though not all) reads that it should have included based on the interval list. It does include these reads if I run it with an interval list that just contains that one interval I'm interested in, but not if I include it among many other intervals.
The interval list is sorted based on the .bam's sequence dictionary (by running
picard BedToIntervalList --SEQUENCE_DICTIONARY ../SRR7205167.1.bam --SORT -I GRCh38_intervals.bed -O GRCh38_intervals.sorted.list
).The underlying issue as far as I can tell, is that the .bam reads are sorted, but not in the same order as its sequence dictionary.
This might be related to https://github.com/broadinstitute/gatk/issues/101
Expected behavior
I think GATK should fail with an error when this occurs. Otherwise it's easy for users to miss the data loss and end up with incorrect analyses.
Actual behavior
Silently drops data.