cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
426 stars 108 forks source link

Lineage defining mutations #168

Closed jdeligt closed 3 years ago

jdeligt commented 3 years ago

I was wondering if you have something similar to this: https://github.com/nextstrain/ncov/blob/master/defaults/clades.tsv for pangolin?

I'm basically looking for an overview of the nucleotide changes that 'define' a certain lineage.

aineniamh commented 3 years ago

Hi @jdeligt, this is a highly requested feature! We haven't got that in place in our setup at the moment partly due to the nature of the assignment model (the decision tree rules are difficult to tease apart to give exact SNPs), but it's something we're working on for the next set of releases for pangolin as the next model will make it more feasible to provide this for each lineage.

jdeligt commented 3 years ago

Thank you for your reply, I know it's hard problem so it's exciting to hear this is being worked on. I'll leave this one open so that are people looking for that data can find it and know the status

aretchless commented 3 years ago

Thanks for the update Áine. There are a couple of places where the documentation refers to this sort of file, but I can't find the file. Based on your description above, I take it that those references are outdated... or am I missing something?

https://github.com/cov-lineages/pangolin/releases/tag/v2.0 "pangoLEARN contains information about the top SNPs that are most positively and negatively associated with a given lineage. The lineage recall report is also available in this repository."

https://cov-lineages.org/pangolin_docs/pangolearn.html "We have pulled out informative sites and this information is included in the data release on pangoLEARN. The top SNPs that are most positively and negatively associated with a given lineage are detailed in those files. More details on this release and its practicalities can be found here."

domenico-simone commented 3 years ago

Hi,

as a follow-up to the thread here, I am wondering if there is any plan soon to give the possibility to fetch the list of lineage-defining SNPs. Description of genomes (especially when it comes to non-lineage-defining SNPs) and VOC/VUI investigations would surely benefit from this feature.

Thanks for your impressive work!

rambaut commented 3 years ago

Hi, the new Scorpio tool is designed for doing just that - https://github.com/cov-lineages/scorpio/

aineniamh commented 3 years ago

There's a new tool from our group by Rachel and Ben called scorpio that can fetch the defining set of mutations from a set of genomes (either relative to an outgroup or the early haplotype from Wuhan). There's also a small number of constellation files for the VOCs available at https://github.com/cov-lineages/constellations/tree/main/constellations/definitions. This has all just been developed very recently and more documentation will be written up shortly. If you're looking for a resource that can give the mutations for all lineages, outbreak.info is a really great website that has lists of SNPs per lineage.

domenico-simone commented 3 years ago

Hi @rambaut and @aineniamh thanks for your replies and hints! I'll test scorpio as soon as possible, in the meantime outbreak.info will do the job. Are you planning to integrate scorpio in pangolin? :grin:

aineniamh commented 3 years ago

We are! It's currently integrated on this branch: https://github.com/cov-lineages/pangolin/tree/newscoring

We're planning to merge into the master next week after some more testing and updating of documentation!

domenico-simone commented 3 years ago

Awesome! Thank you!

cutpatel commented 3 years ago

Are there any plans to have the nucleotide changes available, like asked in the first post? This is still mostly gene based with the AA changes. Also asked in #126.

aineniamh commented 3 years ago

Hi @cutpatel, that isn't really a pangolin issue- we're not hosting any coordinates on this repo now and the config files for post hoc tests were never intended as as reference, just for internal assignment.

I don't think I can resolve your issue #126 as we're not hosting that information here. Apologies! If it's helpful here's the link to gene coordinates on genbank: https://www.ncbi.nlm.nih.gov/nuccore/1798174254. If you need nucleotide coordinates you should be able to convert them with a relatively simple function.

rmcolq commented 3 years ago

This issue is now stale and so am closing it: