broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

HaplotypeCaller inconsistent read filtering #7873

Closed rhysnewell closed 2 years ago

rhysnewell commented 2 years ago

Bug Report

Hi GATK team,

I've just been running some variant calling benchmarks on a set of bacterial isolates and I was trying to understand how GATK HaplotypeCaller chooses which reads to use in its analyses. From what I could tell within the code, HaplotypeCaller sets these default read filters:

line ~336 HaplotypeCallerEngine.java

public static List<ReadFilter> makeStandardHCReadFilters() {
        List<ReadFilter> filters = new ArrayList<>();
        filters.add(new MappingQualityReadFilter(DEFAULT_READ_QUALITY_FILTER_THRESHOLD));
        filters.add(ReadFilterLibrary.MAPPING_QUALITY_AVAILABLE);
        filters.add(ReadFilterLibrary.MAPPED);
        filters.add(ReadFilterLibrary.NOT_SECONDARY_ALIGNMENT);
        filters.add(ReadFilterLibrary.NOT_DUPLICATE);
        filters.add(ReadFilterLibrary.PASSES_VENDOR_QUALITY_CHECK);
        filters.add(ReadFilterLibrary.NON_ZERO_REFERENCE_LENGTH_ALIGNMENT);
        filters.add(ReadFilterLibrary.GOOD_CIGAR);
        filters.add(new WellformedReadFilter());

        return filters;
    }

And the WellformedReadFilter includes these filters:

private void createFilter() {
        final AlignmentAgreesWithHeaderReadFilter alignmentAgreesWithHeader = new AlignmentAgreesWithHeaderReadFilter(samHeader);

        wellFormedFilter = ReadFilterLibrary.VALID_ALIGNMENT_START
                .and(ReadFilterLibrary.VALID_ALIGNMENT_END)
                .and(alignmentAgreesWithHeader)
                .and(ReadFilterLibrary.HAS_READ_GROUP)
                .and(ReadFilterLibrary.HAS_MATCHING_BASES_AND_QUALS)
                .and(ReadFilterLibrary.READLENGTH_EQUALS_CIGARLENGTH)
                .and(ReadFilterLibrary.SEQ_IS_STORED)
                .and(ReadFilterLibrary.CIGAR_CONTAINS_NO_N_OPERATOR);
    }

These all seem sensible and consistent, but I was finding that when I would apply these filters to the output of other tools like samtools view, I'd get different number of reported reads compared to what seemed to be being discovered within HaplotypeCaller (via adding debug messages).

For example, using this SRA sample: https://www.ncbi.nlm.nih.gov/sra/SRR12324251 and mapping the reads against these references using minimap2: https://s3.amazonaws.com/zymo-files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip, and then observing the alignments that appear at position 97201 on the main E. Coli Chromosome we can see that there are 873 alignments overlapping that position: samtools view gatk_bams/Escherichia_coli_complete_genome.fasta.SRR12324251_1.fastq.bam.bam Escherichia_coli_chromosome:97201-97201 | less -S

Applying the above filters to these alignments reports 871 alignments. However, GATK HaplotypeCaller reports only 743 valid alignments (before variant calling). Below is some code where I added debug messages to retrieve the read names that HaplotypeCaller determines to be valid:

~line 476 of HaplotypeCallerEngine.java

    public ActivityProfileState isActive( final AlignmentContext context, final ReferenceContext ref, final FeatureContext features ) {
        if ( forceCallingAllelesPresent && features.getValues(hcArgs.alleles, ref).stream().anyMatch(vc -> hcArgs.forceCallFiltered || vc.isNotFiltered())) {
            return new ActivityProfileState(ref.getInterval(), 1.0);
        }

        if( context == null || context.getBasePileup().isEmpty() ) {
            // if we don't have any data, just abort early
            return new ActivityProfileState(ref.getInterval(), 0.0);
        }

        final boolean debug = (context.getPosition() - 1) % 100000 == 0 || (context.getPosition()  >= 97200 && context.getPosition()  <= 97350) || (context.getPosition()  >= 641000 && context.getPosition()  <= 650000);
        if (debug) {
            System.out.println("Position " + context.getPosition() + " Reads at position " + context.size());
            if (context.getPosition() == 97201 && context.getContig().equals("Escherichia_coli_chromosome")) {
                System.out.println("##READS");
                context.getBasePileup().getReads().forEach(read -> System.out.println(read.getName()));
                System.out.println("##END READS");
            }
        }
...
...
}

So HaplotypeCaller is filtering more reads which is fine, but the problem appears when we look at which alignments are actually being filtered. Below is a table with four example alignments at the previously described position on the E. coli genome. The main output from samtools view is included as well as a column describing whether or not GATK HaplotypeCaller filtered out the read. Two of the alignments were filtered out, neither of which seem to meet the above criteria. What's even more startling is that the last two entries are almost identical alignments, the only difference being the read name however only one of them is filtered by GATK. This is quite perplexing to me, and I can't seem to find anything in the code for HaplotypeCaller that would cause this intentionally.

Read Name | flag | contig | alignment start | MAPQ | CIGAR | MREF | MPOS | Insert | GATK Filtered -- | -- | -- | -- | -- | -- | -- | -- | -- | -- SRR12324251.21519570 | 89 | Escherichia_coli_chromosome | 97055 | 60 | 64M1I13M4D1M4I67M | = | 97055 | 0 | FALSE SRR12324251.14760176 | 163 | Escherichia_coli_chromosome | 97056 | 60 | 63M1I13M4D1M4I68M | = | 97284 | 350 | TRUE SRR12324251.10517076 | 153 | Escherichia_coli_chromosome | 97056 | 60 | 3S63M1I13M4D1M4I65M | = | 97056 | 0 | TRUE SRR12324251.24859820 | 153 | Escherichia_coli_chromosome | 97056 | 60 | 3S63M1I13M4D1M4I65M | = | 97056 | 0 | FALSE

All of the filtered reads as determined by HaplotypeCaller seemt to actually be valid reads and there is no consistency that I can see to what reads are and are not filtered. Perhaps I am missing something quite obvious here, but perhaps a clarification on what read filtering steps are being undertaken by GATK HaplotypeCaller would help me out. If there is no easy explanation, then this might be an unintended bug.

Cheers, Rhys

Affected tool(s) or class(es)

GATK HaplotypeCaller 4.2.2 - HaplotypeCallerEngine

droazen commented 2 years ago

@rhysnewell My guess is that you are seeing the effects of random downsampling. HaplotypeCaller downsamples high-depth sites using a reservoir downsampling algorithm. Try running HaplotypeCaller with the argument --max-reads-per-alignment-start 0 to disable downsampling and let us know if that makes a difference.

rhysnewell commented 2 years ago

Oh nice, that seems to be the ticket. All reads are used when setting that parameter. May I ask what the logic behind performing the downsampling is? Isn't there a risk of removing valid alignments that contribute to low abundance variation events? This would maybe only really be a problem when you are analysing sequences from a population of cells/microbes, but maybe the reward is greater than the risk?

droazen commented 2 years ago

@rhysnewell The default settings for the GATK downsampler are optimized for germline calling on typical (ie., Illumina) short reads data. The goal of the downsampler is to control memory usage at dubious sites where the aligner happens to place huge numbers of low-mapping-quality reads, without harming the quality of the variant calls in any meaningful way. By default, if there are more than 50 reads starting at the exact same alignment start position, GATK will start to randomly eliminate reads -- this gives a maximum total coverage for each locus of 50 * read_length, which is more than sufficient for typical whole-exome or genome germline calling. For other applications, such as amplicon sequencing (where the reads inherently all tend to start at the same positions) or microbial calling, these defaults will not work well, and you are likely better off disabling the built-in downsampler completely, and using a different downsampling approach if memory use is an issue.

rhysnewell commented 2 years ago

Thanks for the explanation, I've generally not run into any memory issues with HaplotypeCaller so I think disabling it might be best in my case. I'll go ahead and close this issue now.

Just as a side note (and I'm not sure if you would be interested in this), but the reason I've been digging so deeply into HaplotypeCaller behaviour is that we've been working on a rust based reimplementation of the HaplotypeCaller but geared towards metagenomic analysis. So the interface is more geared for handling many genomes all in parallel as well as adding in a bunch of metrics that microbiologists would find useful: https://github.com/rhysnewell/Lorikeet/tree/dev & https://github.com/philipc/gkl-rs

I think the work that you and the GATK team is astounding, and it has been really helpful to have all of the code so well documented and well written. Just thought I'd make you aware of the rust effort in case you had any thoughts on it.

Cheers, Rhys