Clade B.1.25 looks confusing

biocyberman commented 4 years ago

Hi @aineniamh

I made a quick and dirty adaptation of all_snps.py script. This is to produce defining_snps.csv and a clades.tsv table for use with nextstrain. During this process, we noticed the defintion of clade B.1.25 looks confusing. See the attached screenshot below

We would expect that B.1.25 should have some unique SNPs of its own comparing to the ancestral clade (i.e. B.1) and other clades of the same level (i.e. B.1.{1,21,22,23} ) should not contain all of its SNPs.

Could you comment on this?

aineniamh commented 4 years ago

Hi @biocyberman

B.1.25 is a tricky one. Below is a snippet of the large GISAID tree with B.1.25 highlighted in green. It has pretty low support overall, but given all the sequences are from Australia and there is quite a lot of diversity nested within this lineage, I think its real. Screenshot 2020-05-11 at 17 11 55

I have this lineage flagged and am waiting on this week's data to see if more data sheds some light on this lineage. It may well be that it's not actually monophyletic with added data, and if that's the case I'll revise the definition, or potentially drop it and reassign the sequences.

There's an additional SNP at 3564GT, but due to an ambiguities within the lineage it's not called as lineage defining. iqtree still places all these sequences together despite the lack of common SNP to all. This is partly why a phylogenetic placement strategy is nice because it is more flexible than requiring a single SNP and other variation is pulling the sequences together. With the all_snps.py script, it's possible to specify the lower limit of what is lineage defining. The current default is 100% of tips within a lineage, but that doesn't work, to check for SNPs that 90% of tips have. It's a bit crude and I have plans to specifically check for ambiguous sites and dynamically lower the threshold if needed, but the current algorithm has been fine for our purposes so far.

These lineage designations are not claiming to be the "true" lineages or stable definitions. As more data is produced and incorporated each week, they are revised. We are using all the pieces of evidence available to us to try to classify in a useful manner what we think are lineages. We define lineages as potential new clusters that have begun in a geographically distinct place (or potentially are associated with other events). Every week we're reviewing this as more data is produced and we get a clearer picture. If something turns out not to be a lineage, we revise the definition.

We're also in the process of trying to set up a public forum in which lineages that fit with our naming system could be suggested and reviewed and would be happy to take suggestions if you have some yourself.

biocyberman commented 4 years ago

That's helpful explanation! Maybe because of this, pangolin doesn't assign any of our sequences to B.1.25. It is just B.1. When I used clade table (keeping SNPS reported on multiple clades) and run nextstrain, some of them could be assigned to B.1.25.

However we did some manual checking with CoV-GLUE for one of the B.1.25 sequences. We could see that the sequence is also classified B.1 and in the proximity of B.1.25. In addition, many B.1.25 on the tree are scattered around (not monophylonetic)

By the way, do CoV-GLUE and COG UK use pangolin and lineages in their back end? Just to see how we can give an unified reference for clade classification of sequences.

I am also testing a different approach to annotate the tree for now.

aineniamh commented 4 years ago

Yeah, I mean I'm not surprised that given the bootstraps, different ML estimations give different topologies. For constructing the pangolin guide tree, within each lineage we mask singleton SNPs, which should help avoid homoplasies or sequencing error affecting the lineage assignment. I've been talking with Josh@CoV-GLUE and I'm now hosting a CSV of these singleton SNPs that are being masked within the guide tree, so CoV-GLUE will be as consistent as possible with our tree, given that it's heuristic. COG UK is using pangolin and lineages.

With this weeks data, I'll re-evaluate this lineage and I'm also going to introduce a new stat to describe each lineage. Currently we have a pangolin recall rate for all of GISAID (~97%) but I'm going to introduce a per-lineage recall rate measure and this will feed into what we classify as a lineage or not.

joshsinger commented 4 years ago

Just to round off this discussion, CoV-GLUE does use the lineages resource but not pangolin. Specifically it uses the lineages CSV file. It then builds its own alignment and tree, and has its own lineage assignment method for sequences. It uses the same software for building the alignment and very similar software for building the tree. But the assignment method is quite different -- CoV-GLUE uses an "evolutionary placement algorithm" (see RAxML / pplacer), which is something I have applied in a variety of virus typing tools. So, overall CoV-GLUE may be thought of as an alternative method to pangolin for calling lineages, but one that is based on the same phylogenetic judgements and assumptions. The two methods should substantially agree; we will soon be able to test this. Having two methods rather than one could be quite useful: the two methods can cross-validate each other, and it also allows users confirm/query the assignments.

aineniamh commented 4 years ago

Yeah, I think it's a really good idea to have the two methods. The ML trees are so big at this point, and we're using the UFbootstrap from IQTree, which is pretty approximate, so confirming these lineages/ showing if they aren't consistent across multiple ML trees is really useful.

biocyberman commented 4 years ago

@joshsinger Any chance to have an remote query API (i.e. batch query via a script) or open-source it so users can setup a local instance (either with a container or a conda environment)?

joshsinger commented 4 years ago

@biocyberman GLUE itself is designed to be run locally, for HCV drug resistance I have users running it in Docker for example. However, I can't provide the means to set up a local instance of CoV-GLUE due to GISAID legal issues.

I am thinking of providing a way for users to automate remote queries. I think possibly the simplest way would for me to provide a small Java jar which could be run on the command line. It would have no dependencies except JRE 8.x or later and an internet connection. It would read in a FASTA, access CoV-GLUE over HTTP, and output to disk tab-delimited text and/or JSON files containing the CoV-GLUE lineage assignments, genome changes and other data. What do you think?

biocyberman commented 4 years ago

@biocyberman GLUE itself is designed to be run locally, for HCV drug resistance I have users running it in Docker for example. However, I can't provide the means to set up a local instance of CoV-GLUE due to GISAID legal issues.

That sounds great that it is already running in a docker container. You can decouple GISAID data and code, and make a build script of some sort so end users can prepare the GISAID they download by themselves.

I am thinking of providing a way for users to automate remote queries. I think possibly the simplest way would for me to provide a small Java jar which could be run on the command line. It would have no dependencies except JRE 8.x or later and an internet connection. It would read in a FASTA, access CoV-GLUE over HTTP, and output to disk tab-delimited text and/or JSON files containing the CoV-GLUE lineage assignments, genome changes and other data. What do you think?

This also works, but the there will be a weak link and bottle neck with internet connection, and central load on the server at your place. In addition, running locally also give peace of mind for running large number of sequences, dealing with sensitive data/metadata. So it would best with the docker container running locally.

joshsinger commented 4 years ago

You can decouple GISAID data and code, and make a build script of some sort so end users can prepare the GISAID they download by themselves.

This is possible and would have the advantages you mentioned. It's just a lot more work than the other idea!

biocyberman commented 4 years ago

Ok, I didn't see that GLUE source code is already open-soure and ready for download. That makes things much easier, at least from my perspective. I can help with preparing the docker image. The only part that will speed things up is to have the data preparation script for ncov. Can you share that?

@aineniamh Sorry for rather off-topic discussion. I will take it to GLUE repo or via email to @joshsinger if I can find the email address.

joshsinger commented 4 years ago

@aineniamh also sorry @biocyberman it's josh.singer@glasgow.ac.uk

aineniamh commented 4 years ago

No worries!

cov-lineages / lineages

Clade B.1.25 looks confusing #8