Open jpalmer37 opened 7 months ago
Thanks to @jts who discovered a potential resolution. It appears that the allele parser is not handling the jump from one mapping sequence to another, which can be resolved by adding && !justSwitchedTargets
to Line 2938 of AlleleParser.cpp
.
Link: https://github.com/freebayes/freebayes/blob/master/src/AlleleParser.cpp#L2938
Before:
while (f != registeredAlignments.end()
&& f->first < currentPosition - lastHaplotypeLength) { ## ADD HERE ##
for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
for (vector<Allele>::iterator a = d->alleles.begin(); a != d->alleles.end(); ++a) {
allelesToErase.insert(&*a);
}
}
positionsToErase.insert(f->first);
++f;
}
After:
while (f != registeredAlignments.end()
&& f->first < currentPosition - lastHaplotypeLength) && !justSwitchedTargets {
for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
for (vector<Allele>::iterator a = d->alleles.begin(); a != d->alleles.end(); ++a) {
allelesToErase.insert(&*a);
}
}
positionsToErase.insert(f->first);
++f;
}
Still uncertain if this is a valid and safe fix. It would be great to get a second opinion on this. I have a BAM file + mapping refs that can be shared to showcase this difference. Thanks!
@pjotrp or @ekg could you give a quick look over the fix we proposed above? In my hands it fixed the depth issue but I'm not familiar enough with the code to say whether its the right thing to do. I'd be happy to send a PR if you can give an opinion on the change.
Describe the bug
Hello. First off, thank you for taking the time to read this. As previously described in https://github.com/freebayes/freebayes/issues/619 and https://github.com/freebayes/freebayes/issues/509, I am encountering regions (typically at the start of a sequence) where Freebayes is aggressively filtering reads despite many attempts to turn filtering off via CLI options. As a result, the Freebayes depth values (DP) are significantly (1-2 orders of magnitude) lower than those reported by
samtools depth
,mpileup
, orpysam
. For the remaining discussion here, I will refer to this example of an influenza sequence:samtools depth
values are shown in gray above, while Freebayes DP values are shown in black. The starting depth values (which all subsequently increase) compared across tools are:samtools depth
: 170 (no filtering)samtools mpileup
: 87pysam
: 87freebayes
: 0I have tried numerous combinations of Freebayes parameters to recover these reads to no avail. While I certainly may have missed the correct combination of parameters in my testing (very open to suggestions here), I have at least attempted the use of every parameter in the
input filters
section, along with the following parameters that were mentioned in the GitHub issues linked above:--use-duplicate-reads
--report-all-haplotype-alleles
--pooled-continuous
--use-best-n-alleles
--min-alternate-count
--min-alternate-fraction
I further investigated the exact reads that Freebayes was discarding based on the detailed log file (
-dd
). In these reads, I looked at the following metrics usingpysam
:None of the above stood out as a candidate factor for filtering out 100% of reads. The only partial factor is that there are long template lengths (> 800 bp) and corresponding missing "proper read pair" flags in about 50% of read pairs (81 / 166) (which likely accounts for the slightly lower
samtools mpileup
andpysam
read counts).My main questions:
To Reproduce
I am working on getting an example alignment file or simulated dataset that can be shared (privacy is a concern). Please let me know what would best help you investigate this.
Additional Context
We are working on influenza, which you likely know has 8 segments. For each isolate, we sequence raw reads with Illumina, align them to 8 chosen references and call SNPs on all segments using a single BAM. Let me know if you require any additional info.
Thank you!