Lineage B.1.25 is not monophyletic

joshsinger commented 4 years ago

Hi, this refers to the 7th May version. One of the 5 sequences from B.1.25 is in a different bit of the tree in anonymised.aln.fasta.treefile from where it should be, grouping just outside B.1.29.

B.1.25 was also not monophyletic in my tree, computed independently from the same CSV file. A couple of other lineages are not monophyletic in my tree, these are B.1.42 and A.1, both are just slightly off, but if you are interested I can supply details as it may indicate issues in the lineage definitions.

Also, related to the issue I previously raised (https://github.com/hCoV-2019/lineages/issues/2) are you perhaps masking out some nucleotide positions in your alignment, e.g. because they are thought to be sequencing errors? This could explain differences between our trees. If so it would be super useful for me if these could be documented somewhere, ideally another CSV file B). I guess it would make your method more reproducible too!

aineniamh commented 4 years ago

Hi Josh,

You're right, we are masking out some SNPS. I've written up some documentation about the process we use for building our guide tree and alignment and am hosting it here now.

I've assigned lineages based on monophyly in our large GISAID tree, which is trimmed to CDS and has some, but minimal levels of masking. The guide tree is built on a heavily masked alignment however, as all singleton SNPs within a lineage are masked out.

This masking in my pipeline wouldn't affect the lineage definitions but could explain the difference in topology between my guide tree and your one with the same sequences.

The intention of the masking is to be very conservative and exclude anything that could be caused by sequencing error in the assignment process in pangolin (i.e. the rationale is "don't believe it until we've seen it twice") and we've seen really nice improvements using this.

Homoplasies are likely to be masked out if they're seen in different lineages once, but this is all to try to provide a less ambiguous lineage assignment as that sort of situation could lead to the sequence jumping around the tree during the iqtree step.

Hope this clears things up, I'd been meaning to write up docs in relation to your #2 issue previously. Happy to chat about this further if you'd like more detail.

In relation to the specific cases you've mentioned, I've looked into them and have attached a couple of images showing where they live in the big tree. Firstly: B.1.25 Despite the low internal bootstraps, which could be due to homoplasies and/or sequencing error, the metadata suggests these sequences are a lineage circulating in Victoria.
Screenshot 2020-05-11 at 17 11 55

A similar situation is here in B.1.42. Low parent bootstrap could be due to a number of reasons, but seeing all the Danish sequences clustered together like this is very compelling from an epidemiological point of view. Screenshot 2020-05-11 at 17 15 15

A.1 is big so I don't have a screenshot, but it corresponds to a lineage cirulating in Washington.

A point I would like to make however is that these lineage designations are not claiming to be the "true" lineages. We are using all the pieces of evidence available to us to try to classify in a useful manner what we think are lineages. We define lineages as potential new clusters that have begun in a geographically distinct place (or potentially are associated with other events). Every week we're reviewing this as more data is produced and we get a clearer picture. If something turns out not to be a lineage, we revise the definition.

We're also in the process of trying to set up a public forum in which lineages that fit with our naming system could be suggested and reviewed and would be happy to take suggestions if you have some yourself.

joshsinger commented 4 years ago

Hi Áine,

This is very useful indeed thanks. The plans for the forum sound excellent. I'll need to have a think to figure out how to reproduce the masking in CoV-GLUE. Is it at all possible that the masked SNPs could be included in an additional column in lineages.csv for the representative sequences, e.g. like this:

name,country,travel history,sample date,epiweek,lineage,representative,masked_snps Australia/VIC87/2020,Australia,,2020-03-15,12.0,B.2.1,1,G324T;T5325A;C24563T

? (I made up the SNPs of course B)

aineniamh commented 4 years ago

Yes, of course- I have that information, so I can add them in. I'll close this issue to let you know when I do that, but it may be tomorrow!

joshsinger commented 4 years ago

Brilliant, thank you!

aineniamh commented 4 years ago

Hi Josh, that file exists now, it's currently a separate file called singletons.csv in the data directory. In the next iteration of data I can add back in the GISAID IDs as well if that would be useful.

I think for the moment it makes more sense to have this information in a separate file as most people accessing the lineages.csv won't necessarily want to know the details about the masking that takes place in the construction of the guide tree.

joshsinger commented 4 years ago

Separate file does make more sense. This is perfect thanks.

cov-lineages / lineages

Lineage B.1.25 is not monophyletic #6