COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

BA.1* PHEC sequences with runs of SNVs and/or indels in ranges 27200-27216 and/or 28478-28495 #207

Closed AngieHinrichs closed 2 years ago

AngieHinrichs commented 2 years ago

I came across a messy little branch in the UShER tree's BA.1 branch with lots of longish internal branches and substitutions occurring in stretches of more or less adjacent bases in the ranges 27200-27216. Many sequences also have substitutions around 28478-28495. Almost all sequences in the branch are England/PHEC-*. Several common BA.1.* sublineage mutations are scattered around the branch. So in case there's some kind of batch effect... here's a file with their IDs:

phecBranch.txt

I took a quick look at minimap2 alignments, and some of the sequences have deletions of varying lengths in those ranges (which would look like Ns to usher which might then impute substitutions based on the sequence's placement), while other sequences have substitutions in those ranges.

Theo Sanderson's taxonium viewer is great for taking a quick look at these:

The coloring options make it even more fun. :)

nickloman commented 2 years ago

Thanks Angie - we'll take a look and refer to the source lab!

BioWilko commented 2 years ago

I've looked into this and the issue is with the source labs bioinformatics pipeline.

All positions where there is <30 coverage rather than filling this gap in with Ns the consensus is coming out with no gap at all (e.g. GCTA------AG where the dashes have <30 coverage comes out as GCTAAG). I've checked some of the affected BAM files and managed to calculate the resulting consensus length for several test BAMs accurately based on counting the number of positions with <30 coverage (inc 0 coverage) and subtracting this from the reference length. This will lead to a number of erroneous alignments across the length of the genome hence the bad branch.

AngieHinrichs commented 2 years ago

Thanks for tracking that down @BioWilko! Can the lab's pipeline be fixed? Will these sequences be updated? (If not then I can exclude them -- just wondering.)