cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 97 forks source link

XBB* with complex S mutations around 145-155 cannot be analysed by Usher (243 seqs, 12 countries) #1882

Closed aviczhl2 closed 1 year ago

aviczhl2 commented 1 year ago

Current proposal:

GISAID_query: T21994C, T21995C, del21997_22000 (S:Y145Q, S:H146K, S:K147T, S:N148T, S:N149K, S:K150V, S:S151G, S:ins152K, S:M153V, S:E154S says GIASID)

No. of sequences: 218 in 12 countries , mostly XBB.1.5 but some are on XBB.1, XBB.1.9.1 or BA.2.10.1, centered in Ukraine and USA.

The 4-nuc deletion makes it very difficult to analyse and it seems currenly no platform can analyse this correctly, definitely deserves a look.

I select "high-coverage only" and result in 4 seqs: EPI_ISL_16507795, EPI_ISL_16548476, EPI_ISL_16919550, EPI_ISL_17211240, all submitters confirmed the frameshift in 21997.


Previous Proposal: sublineage of: XBB.1.9.1 mutations on top of XBB.1.9.1: T21994C,T21995C, del21997_22000 (S:Y145Q, S:H146K, S:K147T, S:N148T, S:N149K, S:K150V, S:S151G, S:ins152K, S:M153V, S:E154S says GIASID) GISAID_query: T21994C,T21995C,T23018C,T28297C No. of Sequences: 17( Ukraine 14, USA 1 UK 1 Moldova 1 )

Sequences: EPI_ISL_17015856, EPI_ISL_17154049, EPI_ISL_17343412, EPI_ISL_17390702, EPI_ISL_17390708-17390712, EPI_ISL_17390714, EPI_ISL_17390717, EPI_ISL_17390720-17390722, EPI_ISL_17390725, EPI_ISL_17390738, EPI_ISL_17390741

First seq: EPI_ISL_17015856(Moldova) 2023-2-6 Latest seq: EPI_ISL_17343412(USA) 2023-3-26

usher

Screen Shot 2023-04-13 at 13 53 43
Over-There-Is commented 1 year ago

The Chinese sequence was imported from Japan image

Over-There-Is commented 1 year ago

but it doesn't contain Y145H mutation

aviczhl2 commented 1 year ago

but it doesn't contain Y145H mutation

How to check mutations for GenBase sequences directly ? Am I missing something? Usher places it on the Y145H branch here and it doesn't seem to contain 145 reversions after the Y145H.

Over-There-Is commented 1 year ago

2023-04-13 (2) 2023-04-13 (3)

aviczhl2 commented 1 year ago
Screen Shot 2023-04-13 at 20 03 11 Screen Shot 2023-04-13 at 20 02 59

Okay I find it. On the official mutation list it has G21987A which is S:G142D, however in usher analysis it has T21994C and T21995C which is S:Y145H.

aviczhl2 commented 1 year ago

I don't know why usher place that in this position but it seems to have S:G142D instead of S:Y145H.

Screen Shot 2023-04-13 at 20 15 42 Screen Shot 2023-04-13 at 21 21 57

@AngieHinrichs

AngieHinrichs commented 1 year ago

21987 (S:142) is masked in all inputs to UShER because it is part of the Problematic Sites set (history: S:G142D is part of Delta, but was very inconsistently detected in Delta sequences due to amplicon dropout, causing a lot of trouble for tree-building).

C_AA009104.1 has a deletion of 3 bases from a 6-base TTATTA repeat that could be 21991-21993 or 21994-21996 depending on whether the alignment tool shifts the deletion left or right:

222222222222222222222
111111111111111222222
999999999999999000000
888889999999999000000
567890123456789012345

GGGTGTTTATTACCACAAAAA  reference
GGATGT---TTACCAAAAAAA  CA_009104 (del left-shifted)
GGATGTTTA---CCAAAAAAA  CA_009104 (del right-shifted)

If the alignment is right-shifted (as it is for the web interface) then bases 21994-21996 are masked to N for UShER because it doesn't know about deletions, only substitutions. But unfortunately that makes it appear to UShER as if the sequence might possibly have substitutions at 21994 and 21995, and it can place the sequence with other sequences that do have those mutations.

Another possibility -- which I have not yet looked into in detail -- is that mutations at 21994 and 21995 might be artifacts due to incorrect assembly of a deletion, i.e. the samples have the deletion but assembly errors make it appear that they have substitutions. I have seen that happen before so I do branch-specific masking of known deletions on big branches.

aviczhl2 commented 1 year ago

Another possibility -- which I have not yet looked into in detail -- is that mutations at 21994 and 21995 might be artifacts due to incorrect assembly of a deletion, i.e. the samples have the deletion but assembly errors make it appear that they have substitutions.

This may be the case. I checked again and found that GISAID tend to assign those sequences as

T21994C,T21995C, del21997_22000 around this site, compared with the usual del21992_21994 for S:Y144-.

Screen Shot 2023-04-14 at 08 01 12 Screen Shot 2023-04-14 at 08 00 59

Actually more than 200 XBB* sequences on different branches of XBB from all around the world(mostly around Ukraine, USA, Czech Republic) all having this set of mutations by querying T21994C, T21995C, del21997_22000. So it is likely real, but we have to figure out what the actual mutation is.

Screen Shot 2023-04-14 at 08 11 10

nextclade cannot analysis the S for any of those sequences though.

AngieHinrichs commented 1 year ago

T21994C, T21995C, del21997_22000

That's striking to me because 21997-22000 means four bases are deleted, not three bases like CA_009104.1. Three bases maintains the coding frame but 4 bases would cause a frameshift. And indeed when I align Scotland/QEUH-326E8AB6/2023 to reference, it has four fewer bases than the reference in that region. However, minimap2 and GISAID are placing those gaps quite differently: minimap2 finds a 3-base gap and later a 1-base gap, while whatever aligner GISAID uses finds two mismatches and one 4-base gap. If the penalty of a gap base is equal to the penalty for a mismatch then the GISAID alignment is objectively worse, but other scoring schemes are possible (e.g. a significant penalty to begin a gap but a small penalty to extend it).

22222222222222222222222
11111111111122222222222
99999999999900000000000
88999999999900000000000
89012345678901234567890

TGTTTATTACCACAAAAACAACA ref
   |||      |           diff (4 if gap == mismatch)
TGT---TTACCA-AAAAACAACA QEUH-326E8AB6, minimap2

TGTTTATTACCACAAAAACAACA ref
      || ||||           diff (6 if gap == mismatch)
TGTTTACCA----AAAAACAACA QEUH-326E8AB6, GISAID

Either way, since a deletion of 4 bases would cause a frameshift, I wonder if there is some kind of assembly problem affecting these. It would be great if someone who knows how to evaluate assemblies / raw reads could take a look. Oh wait... there's @theosanderson's https://deeperseq.genomium.org/ and raw reads for Scotland/QEUH-326E8AB6/2023 are in SRA run ERR10983997, so just paste in ERR10983997 there and zoom in to this region:

image

The G21987A and 3-base del are very clear, and 65 out of 90 reads do seem to be skipping reference base 22000, but 25 out of 90 reads have C22000A instead, weird. I still don't know what to make of it... a frameshift early in S? Zooming out to NC_045512.2:21,666-22,505, it looks like there is amplicon dropout later, so who knows what's going on in there, but that's hundreds of bases after the weird deletion, I would think there'd be a good chance of a stop codon by then but I'm out of time to look at it.

aviczhl2 commented 1 year ago

T21994C, T21995C, del21997_22000

That's striking to me because 21997-22000 means four bases are deleted, not three bases like CA_009104.1. Three bases maintains the coding frame but 4 bases would cause a frameshift. And indeed when I align Scotland/QEUH-326E8AB6/2023 to reference, it has four fewer bases than the reference in that region. However, minimap2 and GISAID are placing those gaps quite differently: minimap2 finds a 3-base gap and later a 1-base gap, while whatever aligner GISAID uses finds two mismatches and one 4-base gap. If the penalty of a gap base is equal to the penalty for a mismatch then the GISAID alignment is objectively worse, but other scoring schemes are possible (e.g. a significant penalty to begin a gap but a small penalty to extend it).

22222222222222222222222
11111111111122222222222
99999999999900000000000
88999999999900000000000
89012345678901234567890

TGTTTATTACCACAAAAACAACA ref
   |||      |           diff (4 if gap == mismatch)
TGT---TTACCA-AAAAACAACA QEUH-326E8AB6, minimap2

TGTTTATTACCACAAAAACAACA ref
      || ||||           diff (6 if gap == mismatch)
TGTTTACCA----AAAAACAACA QEUH-326E8AB6, GISAID

Either way, since a deletion of 4 bases would cause a frameshift, I wonder if there is some kind of assembly problem affecting these. It would be great if someone who knows how to evaluate assemblies / raw reads could take a look. Oh wait... there's @theosanderson's https://deeperseq.genomium.org/ and raw reads for Scotland/QEUH-326E8AB6/2023 are in SRA run ERR10983997, so just paste in ERR10983997 there and zoom in to this region:

image

The G21987A and 3-base del are very clear, and 65 out of 90 reads do seem to be skipping reference base 22000, but 25 out of 90 reads have C22000A instead, weird. I still don't know what to make of it... a frameshift early in S? Zooming out to NC_045512.2:21,666-22,505, it looks like there is amplicon dropout later, so who knows what's going on in there, but that's hundreds of bases after the weird deletion, I would think there'd be a good chance of a stop codon by then but I'm out of time to look at it.

There is a frameshift on 21997 confirmed by multiple sequence submitters. Something complex happens here.

Screen Shot 2023-04-14 at 12 23 51 Screen Shot 2023-04-14 at 12 23 32 Screen Shot 2023-04-14 at 12 23 07
silcn commented 1 year ago

The G21987A and 3-base del are very clear, and 65 out of 90 reads do seem to be skipping reference base 22000, but 25 out of 90 reads have C22000A instead, weird. I still don't know what to make of it... a frameshift early in S? Zooming out to NC_045512.2:21,666-22,505, it looks like there is amplicon dropout later, so who knows what's going on in there, but that's hundreds of bases after the weird deletion, I would think there'd be a good chance of a stop codon by then but I'm out of time to look at it.

A 4-nuc deletion produces a stop codon at nuc:22197-22199 (between S:212-213), which is before the dropout. The deletion of 22000 surely can't be real.

aviczhl2 commented 1 year ago

T21994C, T21995C, del21997_22000

That's striking to me because 21997-22000 means four bases are deleted, not three bases like CA_009104.1. Three bases maintains the coding frame but 4 bases would cause a frameshift. And indeed when I align Scotland/QEUH-326E8AB6/2023 to reference, it has four fewer bases than the reference in that region. However, minimap2 and GISAID are placing those gaps quite differently: minimap2 finds a 3-base gap and later a 1-base gap, while whatever aligner GISAID uses finds two mismatches and one 4-base gap. If the penalty of a gap base is equal to the penalty for a mismatch then the GISAID alignment is objectively worse, but other scoring schemes are possible (e.g. a significant penalty to begin a gap but a small penalty to extend it).

What if we use XBB as reference and see what it looks like on top of XBB, given these seqs are surely descendants of XBBs, XBB is different from original by G142D, Y144- H146Q, Q183E, V213E, which may greatly affect the result?

ryhisner commented 1 year ago

I've always assumed these sequences are all artifacts. They just seem totally implausible. It's an area where sequencing errors are common in XBB, and mutations at numerous consecutive residues combined with a one-AA insertion seems like a classic sequencing-artifact signature to me.

FedeGueli commented 1 year ago

Same here as @ryhisner i remember i tracked something similar in denmark monrhs ago.

aviczhl2 commented 1 year ago

I've always assumed these sequences are all artifacts. They just seem totally implausible. It's an area where sequencing errors are common in XBB, and mutations at numerous consecutive residues combined with a one-AA insertion seems like a classic sequencing-artifact signature to me.

Yes I used to think of that. However the global distribution (centered around Ukraine, with some in countries near Ukraine, also having some in USA, and spread globally to 12 countries) and the multiple confirmations of the frameshift is not a pure artefact way?

Another question, if it is an artefact it shall also appear in other lineages, not only XBBs?

I guess there maybe some artefact combined with some real mutations. There may be mutations happening to "trigger" an artefact and thus explains the geographical and lineage distribution.

aviczhl2 commented 1 year ago

A 4-nuc deletion produces a stop codon at nuc:22197-22199 (between S:212-213), which is before the dropout. The deletion of 22000 surely can't be real.

Yes I agree it cannot be totally non-artefact. However there seems to be other reasons(geographical distribution, lineage distribution etc) suggesting it may not be fully caused by artefact. More likely some mutations and artefact combined together.

silcn commented 1 year ago

I've always assumed these sequences are all artifacts. They just seem totally implausible. It's an area where sequencing errors are common in XBB, and mutations at numerous consecutive residues combined with a one-AA insertion seems like a classic sequencing-artifact signature to me.

Agreed. I notice that many of the sequences have more frameshifts than just this one, including the ones from the Idaho lab that has had issues for ages.

The "insertion" S:ins152K is GISAID getting confused. It recognises the apparent 4-nucleotide deletion but has trouble converting it into AA mutations; it seems to be thrown by the fact the frameshifted sequence has SEF at nuc:22032-22040, and it tries to match this to the SEF155-157 of the original spike (nuc:22025-22033) despite the fact that the nucleotide sequences are different. In order to do this, it has to ignore the deletion and just keep counting as if it wasn't there, match up the new "W152" with the old W152, and stick in an insertion immediately after it. Then for some reason it jumps back to the original reading frame, when one would expect it to continue spitting out frameshifted AAs up to and beyond the stop codon at ~212.

aviczhl2 commented 1 year ago

Then for some reason it jumps back to the original reading frame, when one would expect it to continue spitting out frameshifted AAs up to and beyond the stop codon at ~212.

Perfect analysis.

Now the problem is: Anyone having any idea on the "some reason" for GISAID to ignore a 7-nuc difference and jump back to the original reading frame? (which means 22034-22040 is read twice)

aviczhl2 commented 1 year ago

@silcn I manually copied the 21991-22199 position for QEUH-326E8AB6

TTACCAAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGAATA ATTGCACTTTTGAATATGTCTCTCAGCCTTTTCTTATGGACCTTGAAGGAAAAGAGGGTAATTTCAAAAATCTTAGGGAA TTTGTGTTTAAGAATATTGATGGTTATTTTAAAATATATTCTAAGCACACGCCTATTAATTTAG

21991-21996 TTACCA 22001-22010 AAAAACAACA 22011-22020 AAAGTTGGAT 22021-22030 GGAAAGTGAG 22031-22040 TTCAGAGTTT 22041-22050 ATTCTAGTGC 22051-22060 GAATAATTGC 22061-22070 ACTTTTGAAT 22071-22080 ATGTCTCTCA 22081-22090 GCCTTTTCTT 22091-22100 ATGGACCTTG 22101-22110 AAGGAAAAGA 22111-22120 GGGTAATTTC 22121-22130 AAAAATCTTA 22131-22140 GGGAATTTGT 22141-22150 GTTTAAGAAT 22151-22160 ATTGATGGTT 22161-22170 ATTTTAAAAT 22171-22180 ATATTCTAAG 22181-22190 CACACGCCTA 22191-22199 TTAATTTAG

yeah the sequence produces a stop on 22197-22199. Not Ns.

aviczhl2 commented 1 year ago

The amount of sequence grows to 243. Austria, Czech, India, Ireland, Moldova, Poland, Spain, UK, USA, Ukraine, Zambia, France.

ItokawaK commented 1 year ago

Probably, the true sequence is with 3-nt deletion plus C22000A substitution like below.

TGTTTATTACCACAAAAACAACA 
TGT---TTACCAAAAAAACAACA 
            ^
          22000

The C22000A substitution creates a long consecutive track of A. Especially for IonTorrent and Nanopore, such a homopolymer sequence is difficult to basecall correctly. It often wrongly be base called with fewer number of nucleotide.

I queried del21997_22000 for GISAID today and obtained 314 genomes. Among those, 242 genomes (77%) were sequenced by IonTorrent and 63 (20%) were by Nanopore.

corneliusroemer commented 1 year ago

@ItokawaK's explanation sounds reasonable It's just the normal S:144- deletion (common to all XBB) plus a C->A that cause an artefact.

The current title is misleading - these are not complex S mutations - but artefactual frameshifts so I'll close this issue. To define a lineage we needs a reliable marker - apart from the artefact-prone area. It's for example not clear that these sequences arose in one event or multiple.

Re this Nextclade comment:

nextclade cannot analysis the S for any of those sequences though.

Nextclade can and does analyse the S but it tells you: there's a frameshift, there's a problem here, so we won't show you the hundreds of mutations caused by the frameshift (that wouldn't be helpful).