Inexplicably low DP values

jpalmer37 commented 7 months ago

Describe the bug

Hello. First off, thank you for taking the time to read this. As previously described in https://github.com/freebayes/freebayes/issues/619 and https://github.com/freebayes/freebayes/issues/509, I am encountering regions (typically at the start of a sequence) where Freebayes is aggressively filtering reads despite many attempts to turn filtering off via CLI options. As a result, the Freebayes depth values (DP) are significantly (1-2 orders of magnitude) lower than those reported by samtools depth, mpileup, or pysam. For the remaining discussion here, I will refer to this example of an influenza sequence:

samtools depth values are shown in gray above, while Freebayes DP values are shown in black. The starting depth values (which all subsequently increase) compared across tools are:

samtools depth: 170 (no filtering)
samtools mpileup: 87
pysam: 87
freebayes: 0

I have tried numerous combinations of Freebayes parameters to recover these reads to no avail. While I certainly may have missed the correct combination of parameters in my testing (very open to suggestions here), I have at least attempted the use of every parameter in the input filters section, along with the following parameters that were mentioned in the GitHub issues linked above:

--use-duplicate-reads
--report-all-haplotype-alleles
--pooled-continuous
--use-best-n-alleles
--min-alternate-count
--min-alternate-fraction

I further investigated the exact reads that Freebayes was discarding based on the detailed log file (-dd). In these reads, I looked at the following metrics using pysam:

number of aligned segments
read alignment flags
base quality
mapping quality
CIGAR strings
template length
reference start position
a long list of other binary outcomes (is_paired, is_proper_pair, is_unmapped, is_duplicate, is_supplementary, ... etc)

None of the above stood out as a candidate factor for filtering out 100% of reads. The only partial factor is that there are long template lengths (> 800 bp) and corresponding missing "proper read pair" flags in about 50% of read pairs (81 / 166) (which likely accounts for the slightly lower samtools mpileup and pysam read counts).

My main questions:

Does Freebayes contain intrinsic filters that cannot be turned off and may be the root cause of this?
Do you have any suggestions of parameter combinations to try that would give the best shot at remediating this overly-aggressive filtering?

To Reproduce

I am working on getting an example alignment file or simulated dataset that can be shared (privacy is a concern). Please let me know what would best help you investigate this.

Additional Context

We are working on influenza, which you likely know has 8 segments. For each isolate, we sequence raw reads with Illumina, align them to 8 chosen references and call SNPs on all segments using a single BAM. Let me know if you require any additional info.

Thank you!

jpalmer37 commented 6 months ago

Thanks to @jts who discovered a potential resolution. It appears that the allele parser is not handling the jump from one mapping sequence to another, which can be resolved by adding && !justSwitchedTargets to Line 2938 of AlleleParser.cpp.

Link: https://github.com/freebayes/freebayes/blob/master/src/AlleleParser.cpp#L2938

Before:

    while (f != registeredAlignments.end()
           && f->first < currentPosition - lastHaplotypeLength) {     ## ADD HERE ##
        for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
            for (vector<Allele>::iterator a = d->alleles.begin(); a != d->alleles.end(); ++a) {
                allelesToErase.insert(&*a);
            }
        }
        positionsToErase.insert(f->first);
        ++f;
    }

After:

    while (f != registeredAlignments.end()
           && f->first < currentPosition - lastHaplotypeLength) && !justSwitchedTargets {
        for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
            for (vector<Allele>::iterator a = d->alleles.begin(); a != d->alleles.end(); ++a) {
                allelesToErase.insert(&*a);
            }
        }
        positionsToErase.insert(f->first);
        ++f;
    }

Still uncertain if this is a valid and safe fix. It would be great to get a second opinion on this. I have a BAM file + mapping refs that can be shared to showcase this difference. Thanks!

jts commented 6 months ago

@pjotrp or @ekg could you give a quick look over the fix we proposed above? In my hands it fixed the depth issue but I'm not familiar enough with the code to say whether its the right thing to do. I'd be happy to send a PR if you can give an opinion on the change.

freebayes / freebayes

Inexplicably low DP values #781