cov-lineages / scorpio

serious constellations of reoccurring phylogenetically-independent origin
GNU General Public License v3.0
38 stars 4 forks source link

scorpio overriding AY.4.2 to B.1.617.2 on very few Ns #32

Closed mcroxen closed 2 years ago

mcroxen commented 2 years ago

Example: EPI_ISL_5076925

EPI_ISL_5076925,AY.4.2,0.0,0.9996771068776235,Delta (AY.4.2-like),0.960800,0.000000,PLEARN-v1.2.86,3.1.15,2021-10-13,v1.2.86,passed_qc,scorpio call: Alt alleles 49; Ref alleles 0; Amb alleles 2; Oth alleles 0

Note this genome has only 1 ambiguous base (Y:14120) and no Ns, so the Amb alleles 2 is unusual

Further, if we mutate positions as part of cAY.4.json (e.g., 13,322 and 28270-28,271), then pangolin will return B.1.617.2 as the lineage

11332N|28270N|28271N,B.1.617.2,0.0,0.9990306946688207,Delta (B.1.617.2-like),1.000000,0.000000,PLEARN-v1.2.86,3.1.15,2021-10-13,v1.2.86,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0; scorpio replaced lineage assignment AY.4.2

Just to show that the correct mutations were introduced (nextclade output)

seqName,deletions,missing,nonACGTN EPI_ISL_5076925,"22029-22034,28248-28253,28271",,Y:14120 11332N|28270N|28271N,"22029-22034,28248-28253","11332,28270-28271",Y:14120

rmcolq commented 2 years ago

This does seem a bit odd. As far as I can see, you are getting the right scorpio call, but have a score of 2 ambiguities which doesn't match the expected number of 1. And when you add in a couple more Ns, it reverts back to the parent.

The fact that you get a score of 2 ambiguous alleles for AY.4.2 is not perfect. I think it may be weirdly handling the deletion. But when I tried investigating locally, I pulled down your sequence from gisaid and ran the following

minimap2 -a -x asm20 MN908947.fa EPI_ISL_5076925.fasta | gofasta sam toMultiAlign > aligned.fasta
scorpio haplotype -i aligned.fasta -n AY.4 AY.4.2 B.1.617.2 --label mrca_lineage

which gave me the following set of haplotypes out for the mutation sites of these constellations

query,AY.4.2,B.1.617.2,AY.4
hCoV-19/Canada/ON-PHL-21-34578/2021|EPI_ISL_5076925|2021-09,CHVT,R-RKRNLTAIGMYD,TTSLSVTLIAGLSLVR2RKGRNLTAIT20GMCYT

and when I ran classify (which is what pangolin runs) (the number of refs is total across all 3 because it called AY.4.2 and that inherits parent mutations from AY.4 and that inherits parent mutations from B.1.617.2)

query,constellations,mrca_lineage,ref_count,alt_count,ambig_count,other_count,rule_count,support,conflict,constellation_name
hCoV-19/Canada/ON-PHL-21-34578/2021|EPI_ISL_5076925|2021-09,Delta (AY.4.2-like),AY.4.2,0,51,0,0,9,1.000000,0.000000,Delta (AY.4.2-like)

This hasn't called ANY ambiguous alleles.

Could you provide me with your mutated fasta to investigate futher? Also, can you confirm the version for your install of constellations, scorpio and minimap2?

mcroxen commented 2 years ago

I don't readily have it on hand, but the way that I made the mutations (if it speeds things up):

  1. align sequence and Wuhan reference with mafft (ultimately excluding the Wuhan reference from the alignment)
  2. used seqkit mutate -p 11332:N -p 28270:N -p 28271:N aligned_genome.fa, then stripped any gaps ('-')

constellations should have been 0.0.20 (ran pangolin --update 2 days ago), and I would only be guessing on the minimap2 version, but can check later today if that is the issue.

Also, I wouldn't think Y:14120 should contribute as an ambiguous base in this instance, and would have expected Amb alleles 0 (at least in the original sequence before the above mutations)

rmcolq commented 2 years ago

My theory I am hoping to investigate is that it is because I have been extracting the relevant 3 nucleotides from the alignment and translating to an amino acid using biopython for many of these. I think if there is a - in the 3 nucleotides, but it is not all --- it may translate to X in the same way it would if there were an ambiguous base there. This would explain slight variations in results given differently aligned inputs. If this is the problem, I should be able to strip - within a local neighbourhood, but I guess this needs to be done with care so as to not disrupt the coordinates of the allele too much.

AngieHinrichs commented 2 years ago

I noticed while looking at another case where Scorpio overrode an AY.4.2 call (USA/WA-UW-21092887625/2021 EPI_ISL_5258403) that while cAY.4.json defines 34 sites and has a max_ref of 3, it has a min_alt of 31 which doesn't allow for very many Ns. USA/WA-UW-21092887625/2021 has one ref allele (lack of a deletion) and more than 2 Ns, so it did not pass the min_alt rule for AY.4, and I think that caused it to fail AY.4.2 also. Lowering cAY.4.json's min_alt to 28 was sufficient for USA/WA-UW-21092887625/2021 to pass for AY.4 and AY.4.2.

I know this doesn't address the matter of more ambiguous sites being reported than expected, but perhaps lowering AY.4's min_alt might help some other missed-AY.4.2 cases?

KatSteinke commented 2 years ago

Potentially related to this, I've just seen Scorpio overriding an AY.4 call despite the characteristic mutation being present. This only happened after a recent update, though. Specifically: I ran Pangolin on a batch of samples, both with older versions and the newest ones. For one sample (hCoV-19/Denmark/DCGC-186457/2021):

These do have a fair amount of N's though: sample had 1431 N's; another sample which got correctly called as AY.4 in both setups had 838. Should this be a separate issue, or is something similar going on here?

rmcolq commented 2 years ago

I think there are 2 possibly issues happing concurrently with AY.4 calls:

  1. The constellation currently has too stringent a threshold set for "min_alt" - ie we are failing things with more ambiguous bases. We will probably change this later today.
  2. I think that maybe scorpio calls some sites as ambiguous if the alignment shifts to overlap the codon (need to check). I'm still trying to think if there is a good/non-dangerous way to handle this situation other than having a constellation definition which has enough flexibility to allow for it
KatSteinke commented 2 years ago

I've just checked the scorpio call for the sample, looks like we've got one ambiguous allele? scorpio call: Alt alleles 12 Ref alleles 0 Amb alleles 1 Oth alleles 0 scorpio replaced lineage assignment AY.4 (As far as I can see, there's no non-ATGCN bases in the genome, just the N's I reported.)

AngieHinrichs commented 2 years ago

scorpio call: Alt alleles 12 Ref alleles 0 Amb alleles 1 Oth alleles 0 scorpio replaced lineage assignment AY.4

Since those total to 13 (the number of sites defined for B.1.617.2), I think those are the counts for the constellation selected by Scorpio, not the counts for AY.4. To get the counts for every constellation, run pangolin with the --no-temp option (and you'll probably want to add --outdir or a lot of files will be added to your current directory); then look for files named VOCreport.scorpio.Delta*.csv.

rmcolq commented 2 years ago

I believe these examples now get the correct scorpio constellation result with the updated constellations definitions. However I will continue investigating the ambiguous calls at site(s) of interest.

KatSteinke commented 2 years ago

I've just checked, for me the updated constellation definitions do give the correct result now. (Have yet to find a few minutes to run pangolin with --no-temp, but once I do I can go back to the old env with version 0.0.20 if needed.) EDIT: to follow up on that, this gave me 41 alt alleles, 6 ambigous alleles, 0 ref alleles, 0 other alleles.

AngieHinrichs commented 2 years ago

Thanks for confirming @KatSteinke! No need to find time for --no-temp if it's working now. :)

rmcolq commented 2 years ago

Whilst it is still possible that we are getting more ambiguous calls than expected with scorpio, all examples I've looked at where this was causing a problem I think are now resolved. I'm going to close this issue but if an example is found again where there is still a problem, please do reopen