andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
100 stars 29 forks source link

Wrong classification sublineages? #151

Closed LauraVP1994 closed 11 months ago

LauraVP1994 commented 1 year ago

Dear,

We are using your tool on spiked samples and come across some discrepancies between what we expect and what we get. For example we used a bought RNA virus at 100% that should be B.1.1.7 (this result was also confirmed on GISAID by the company and we ourselves confirmed this also with the latest version of Pangolin. However with Freyja (v1.3.12) we get Q.1 as output. This is somewhat surprising to us, because based on the GISAID database with the Pangolin results we defined that the difference between B.1.1.7 and Q.1 is one mutation, namely C15400T, and when looking both in our consensus results and results at low-frequency we did not find this mutation and it was 100% C at this position.

Therefore, we were wondering whether you have an explanation for this difference?

Thank you!

joshuailevy commented 1 year ago

Hi @LauraVP1994!

Thanks for bringing this up! As you mention, the two lineages can be distinguished by a single mutation (C14220T- note, not C15400T). However the barcodes actually include not only this mutation, but also a few others that are largely shared by B.1.1.7 and Q.1 as far as I can tell ('C3267T', 'C5388A', and 'G28048T') and it's they possible should be further up the branch and may be misattributed in the UShER global phylogenetic tree- will need to look in greater detail/chat with some of the UShER folks. It's possible these may be causing the difference in assignment.

Have you run these samples using the default usher mode of pangolin or with the pangoLearn model? If you are using the usher mode Freyja should produce the same result (they both use the same tree), but it's possible something funny is going on.

Best, Josh

LauraVP1994 commented 1 year ago

Indeed, was due the numbering of a sample but compared to the reference it should be C14220T.

I was wondering where you find these barcodes, because we ran a script on all GISAID sequences and assigned mutations as characteristic for a lineage if more than 80% of the sequences attributed to this lineage (with Pangolin) had the mutation. When comparing B.1.1.7 and Q.1 the only difference I observe is C15400T. As for these three mutations you mentioned, we found them to be attributed to both B.1.1.7 and Q.1

We normally used the usher mode of pangolin.

joshuailevy commented 1 year ago

Yeah, I agree- they seem to be common to both lineages. It's possible that this is an issue with the UShER tree itself or some of the tree processing tools available in matUtils(will follow up on this). To pull out the barcodes you'd just need a script like the following (and the usher_barcodes.csv file):

check_barcodes.py.txt

LauraVP1994 commented 1 year ago

Thank you for the quick response. Would be great if you could look into this. We seem to observe this issue of the classification of sublineages already in multiple samples especially now in recent samples that are all omicron sublineages... These results were obtained without bootstrapping though, would our results be better after bootstrapping?

joshuailevy commented 1 year ago

A bit more detail would help better diagnose what's going first, it's entirely possible to get misleading results if genome coverage is low, for example. Bootstrapping will help quantify confidence in the estimated lineage frequencies, but won't lead to different predictions.

Any chance you can provide an example fastq/bam? If you'd rather not share them publicly, you can also send them to me at jolevy@scripps.edu.

LauraVP1994 commented 12 months ago

Dear,

Here you can find one of the bam files (https://drive.google.com/file/d/14jVGTCzs_kXf-BECLASkGGqXsi4pUMkE/view?usp=drive_link). We are testing this tool with different coverages and initial viral concentrations and this sample should have a relatively high coverage and viral concentration, so it would surprise me that this would be the problem, especially as I can also not find these mutations in the lofreq files...

Kind regards Laura

joshuailevy commented 12 months ago

Thanks a ton! I'll investigate this more today.

joshuailevy commented 12 months ago

Had to realign that bam to the Hu-1 reference, but I agree- there's something not quite right about the placement of the B.1.1.7 root. The three mutations I mentioned above ('C3267T', 'C5388A', and 'G28048T') seem to be ubiquitous among B.1.1.7 and Q.1. Just reached out to @AngieHinrichs, who should be able to help sort this out :)

https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?pangoLineage=B.1.1.7&pangoLineage1=Q.1&analysisMode=CompareEquals&

AngieHinrichs commented 11 months ago

Sorry about the delayed reply - you caught me in the middle of a vacation. 🙂 I originally put the B.1.1.7 label on a node in the UShER tree that is a few nodes upstream of the real (complete) B.1.1.7 node, to include some sequences that were missing one or more of the mutations that @joshuailevy mentioned. That helps to identify more VoC Alpha sequences, but it does omit some mutations that we would expect to find in Alpha sequences.

So to better accommodate both needs (calling the correct lineage for incomplete sequences and correctly characterizing mutations in the lineage), I'm going to add a B.1.1.7_dropout label in the UShER tree (when used by pangolin, the _dropout will be removed so those sequences are assigned B.1.1.7) a few nodes more upstream of the current B.1.1.7, and I will move the B.1.1.7 label downstream to include C3267T, C5388A, and G28048T. That will take effect in the 2023-08-14 tree when it is finished later today (probably within ~6 hours).

(@joshuailevy BA.1 and BA.2 already have _dropout labels, but they also are labeled a bit upstream of what I think is their real start - should I move those too?)

joshuailevy commented 11 months ago

No worries at all, thanks for following up on this Angie! Makes sense why those might end up upstream, I'm curious whether or not those are real intermediates or just reversions or sequencing errors. I think adding the dropout label is the right way to handle this though.

I just looked at BA.1, which I as you mention is currently rooted a bit further upstream, such that BA.1.1 has an extra ['C24503T', 'G22599A', 'G23048A', 'T5386G'] in relation to BA.1, whereas I believe it's supposed to just differ by G22599A (the R346K mutation). I think it probably makes sense to do something similar there, such that the other three mutations become part of BA.1. For BA.2 it looks like all of the BA.2 descendants have an extra C2790T, suggesting that this mutatino may be better attributed to the BA.2 root as well?

LauraVP1994 commented 11 months ago

Thank you for having a look! I was wondering if the Omicron variants also have been included in this new version by now? Moreover, when I use freya update (with conda) the version is 1.3.12 but when I check releases on github it is already at 1.4.5.

joshuailevy commented 11 months ago

No problem @LauraVP1994. I think this issue should be solved now.

With regard to the above message: You'll need to update Freyja itself using conda, actually. freyja update will only update the barcodes, it will not update the freyja package itself.

Josh

LauraVP1994 commented 10 months ago

Hello,

I have rerun it with the update and it seems like the problem persists? I still get Q.1 instead of B.1.1.7...

Kind regards Laura

joshuailevy commented 10 months ago

Hi @LauraVP1994! I just tested it out on the sample you sent- and it seems to be working fine. Output looks like this:

summarized  [('Alpha', 0.9983097010145707)]
lineages    B.1.1.7
abundances  0.99830970
resid   7.048867169874522
coverage    99.76925392101127

I had to convert back to fastq format and then realign to the Hu-1 reference, but it seems to be performing as expected. When I open the bam file you sent it looks like there's some alignment issues. How are you performing alignment? Regardless of aligner, the reads should be aligned to the attached sequence. NC_045512_Hu-1.fasta.txt

LauraVP1994 commented 10 months ago

Hello!

I redid the analysis again. I had a new installation of freyja through conda which I updated and updated the barcodes again and that seemed to do the trick, maybe the update last time was not successful!

Thanks

LauraVP1994 commented 7 months ago

Hello,

@joshuailevy I did have an additional question. I was wondering if freyja only uses substitutions to define the lineages? In usher_barcodes I think I only saw substitutions and no indels? I was thus wondering if these were also taken into account?

Kind regards Laura

joshuailevy commented 7 months ago

Hi @LauraVP1994!

Correct, Freyja pulls lineage defining mutation data directly from the UShER global phylogenetic tree, which only includes SNPs (no indels). We have considered building versions that do incorporate indels, but it is somewhat unclear how indels should be weighted relative to substitutions (which of course is a common issue in phylogenetic inference). Happy to discuss further if you'd like!

Best, Josh