Closed AngieHinrichs closed 2 years ago
Thanks Angie - we'll take a look and refer to the source lab!
I've looked into this and the issue is with the source labs bioinformatics pipeline.
All positions where there is <30 coverage rather than filling this gap in with Ns the consensus is coming out with no gap at all (e.g. GCTA------AG where the dashes have <30 coverage comes out as GCTAAG). I've checked some of the affected BAM files and managed to calculate the resulting consensus length for several test BAMs accurately based on counting the number of positions with <30 coverage (inc 0 coverage) and subtracting this from the reference length. This will lead to a number of erroneous alignments across the length of the genome hence the bad branch.
Thanks for tracking that down @BioWilko! Can the lab's pipeline be fixed? Will these sequences be updated? (If not then I can exclude them -- just wondering.)
I came across a messy little branch in the UShER tree's BA.1 branch with lots of longish internal branches and substitutions occurring in stretches of more or less adjacent bases in the ranges 27200-27216. Many sequences also have substitutions around 28478-28495. Almost all sequences in the branch are
England/PHEC-*
. Several common BA.1.* sublineage mutations are scattered around the branch. So in case there's some kind of batch effect... here's a file with their IDs:phecBranch.txt
I took a quick look at minimap2 alignments, and some of the sequences have deletions of varying lengths in those ranges (which would look like Ns to usher which might then impute substitutions based on the sequence's placement), while other sequences have substitutions in those ranges.
Theo Sanderson's taxonium viewer is great for taking a quick look at these:
The coloring options make it even more fun. :)