cov-lineages / lineages

Resources for calling and describing the circulating lineages of SARS-CoV-2
Other
37 stars 10 forks source link

GISAID IDs for sequences in tree #9

Closed SarahNadeau closed 4 years ago

SarahNadeau commented 4 years ago

Thanks for this resource! Would it be possible to publish a mapping between the tips in the pangolin guide tree (anonymised.aln.fasta.treefile) and GISAID IDs? I'd like to be able to add sequences to the tree for a visualization of where our Swiss sequences fall in the global context. Unless I'm missing something, I think I need more info to be able to pull the relevant sequences from GISAID.

aineniamh commented 4 years ago

Unfortunately, that would be against the GISAID data sharing agreement. We anonymise the names in the guide tree and alignment specifically so we're not sharing information we're not allowed to.

Also, the guide tree sequences are heavily masked (all singleton SNPs within each lineage are masked out), so they don't truly reflect the GISAID sequences that they derive from, so you wouldn't necessarily want to label them with the original ID anyway.

Sorry about that, but if you're interested in seeing this information, running pangolin through https://pangolin.cog-uk.io/ will let you see where your seqeunce lies in the global tree, and if you're interested in running some phylogenetics I think it's fairly straightforward to apply for GISAID access. From early next week, we'll have a GISAID ID-> lineage mapping hosted on this repository. So it should be fairly straightforward to get access to the sequences yourself.

SarahNadeau commented 4 years ago

I think there was a bit of a misunderstanding, all I am looking for would be a GISAID ID -> tip ID mapping. E.g., which GISAID ID does 2_B.1 correspond to? I already have a GISAID account.

When you say GISAID ID -> lineage mapping, is this what you mean?

aineniamh commented 4 years ago

I think there was a bit of a misunderstanding, all I am looking for would be a GISAID ID -> tip ID mapping. E.g., which GISAID ID does 2_B.1 correspond to?

I understand what you're looking for, but I can't give that information. I also don't think it would be appropriate to label the pangolin output tree with that information either as the sequences have their singletons masked, so their representation in the phylogeny doesn't directly relate to the original sequence or GISAID ID.

If you're looking for a list of GISAID IDs that have the lineage assigned to them (that's what I mean by GISAID ID -> lineage mapping), we run pangolin on all of GISAID every week and are hosting that information. But this will not be only the sequences in the guide tree, it'll be them and all other GISAID sequences that passed QC on our end.

SarahNadeau commented 4 years ago

Ah, I see. I was somehow thinking the encryption of the sequences is a sufficient hindrance to those without GISAID access, so that it would be okay to de-anonymise the sequences. Thanks for the explanation.

aineniamh commented 4 years ago

When we asked about it, I think the anonymisation was actually more important in fact. Sorry about that! I also feel it's important for us here to err on the side of caution too, to keep this resource publically available!