BA.1* PHEC sequences with runs of SNVs and/or indels in ranges 27200-27216 and/or 28478-28495

AngieHinrichs commented 2 years ago

I came across a messy little branch in the UShER tree's BA.1 branch with lots of longish internal branches and substitutions occurring in stretches of more or less adjacent bases in the ranges 27200-27216. Many sequences also have substitutions around 28478-28495. Almost all sequences in the branch are England/PHEC-*. Several common BA.1.* sublineage mutations are scattered around the branch. So in case there's some kind of batch effect... here's a file with their IDs:

phecBranch.txt

I took a quick look at minimap2 alignments, and some of the sequences have deletions of varying lengths in those ranges (which would look like Ns to usher which might then impute substitutions based on the sequence's placement), while other sequences have substitutions in those ranges.

Theo Sanderson's taxonium viewer is great for taking a quick look at these:

Go to https://cov2tree.org/
Click "load the public SARS-CoV-2 tree"
When it loads, in the "Search" panel on the right, select "Name" and paste in England/PHEC-5Q07CZ8A/2022
"1 Result" and a magnifying glass icon should appear just below those inputs. Click on the magnifying glass to zoom in on England/PHEC-5Q07CZ8A/2022
Scroll to zoom out (way out) to see the branch in which England/PHEC-5Q07CZ8A/2022 is placed. Notice that compared to surrounding branches, it's very spread out horizontally and very few nodes have multiple sequences.
Hover over branch lines to see substitutions associated with those nodes of the tree. Hover over circles to see sequence names, lineage assignments etc.

The coloring options make it even more fun. :)

In the "Colour By" panel on the right, select "Genotype", then select "nt" and paste in 28488
Click the up or down triangles in the Residue input box to scan through successive positions
With 28488 as Residue, scroll to zoom out and see that BA.1 and BA.2 have some other messy little branches with whatever's going on at 28478-28495 (and often at least some England/PHEC-* sequences).

nickloman commented 2 years ago

Thanks Angie - we'll take a look and refer to the source lab!

BioWilko commented 2 years ago

I've looked into this and the issue is with the source labs bioinformatics pipeline.

All positions where there is <30 coverage rather than filling this gap in with Ns the consensus is coming out with no gap at all (e.g. GCTA------AG where the dashes have <30 coverage comes out as GCTAAG). I've checked some of the affected BAM files and managed to calculate the resulting consensus length for several test BAMs accurately based on counting the number of positions with <30 coverage (inc 0 coverage) and subtracting this from the reference length. This will lead to a number of erroneous alignments across the length of the genome hence the bad branch.

AngieHinrichs commented 2 years ago

Thanks for tracking that down @BioWilko! Can the lab's pipeline be fixed? Will these sequences be updated? (If not then I can exclude them -- just wondering.)

COG-UK / dipi-group

BA.1* PHEC sequences with runs of SNVs and/or indels in ranges 27200-27216 and/or 28478-28495 #207