Include variations from processed data in nextstrain

SWISS-MODEL / covid-19-Annotations-on-Structures

Mapping sequence data onto structures for the Covid-19 Biohackathon April 2020

https://github.com/virtual-biohackathons/covid-19-bh20/wiki/Annotations-on-Structures

MIT License

2 stars 8 forks source link

Include variations from processed data in nextstrain #10

Open gtauriello opened 4 years ago

gtauriello commented 4 years ago

Goal is to have a structure-mapped version of the variations displayed in nextstrain. We envision the following required steps:

parse json data following their dev docs
map variations onto UniProtKB ACs used in SWISS-MODEL (the work done at the UCSC Genome Browser could be helpful for this)
define colors and annotation texts variations
test using SWISS-MODEL's annotation system
properly acknowledge source of data (see also "Data" section in nextstrain's README)
followups: add possibility to filter results (e.g. only from country X or certain confidence), process into entropies, ...

gtauriello commented 4 years ago

Preliminary work by @jttkim (see here) could be a great starting point for such an effort.

tomasMasson commented 4 years ago

I'll start working in the scripts to fetch variation data from Nextstrain. If you want @gtauriello, I can create a new branch so everyone can see/review the code.

gtauriello commented 4 years ago

That's great. Thank you. Yes please do this in a new branch or start a pull request early so people can comment on your code.

D-Barradas commented 4 years ago

Hi @tomasMasson I'm interested in the branch you will create so I was also working into parsing the variation of nextstrain , I got a result, but my code is very basic and could be more pythonic, so I'm really interested in seeing a code, also what I found as mutations are very strange to me like N3833K (below), I retrieved like 50 like this , so Im asking for a friend here if somebody knows whats with that large number

   gene           GenBank.        gisaid_epi_isl         mutations     author
                      accession     
   ORF1a        LR757998        EPI_ISL_406798      **L2235I**      Chen et al|
   ORF1a        LR757998        EPI_ISL_406798      **N3833K**  Chen et al|

gtauriello commented 4 years ago

@D-Barradas not sure what you mean with strange mutations. You mean because of 3833 being a large number? ORF1a (aka 'Replicase polyprotein 1a' or P0DTC1 or R1A_SARS2) is indeed a 4405 AA long polyprotein (which is cut into smaller pieces). So not too surprising.

Also please don't map mutations to ORF1a but to the longer ORF1ab (aka 'Replicase polyprotein 1ab' or P0DTD1 or R1AB_SARS2) as described in the README of this repo whenever possible. There is a small part (nsp11) at the end of ORF1a where this is ambiguous though due to a ribosomal frameshift (see here for details). There you can either map genome-level variations to both ORF1a and ORF1ab or just keep ignoring the ORF1a part since I am not aware of any relevant role of nsp11.

D-Barradas commented 4 years ago

@gtauriello thanks for solving my question, it was in did about the number since I was thinking in terms of smaller pieces (400 aa ), then another question, they report in nextstrain ORF1a and ORF1b as separate entities, should we also ignore the ORF1b just to be safe?

	ORF1a	ORF1b
end	13468	21555
seqid	config/reference.gb	config/reference.gb
start	266	13468
strand	+	+
type	CDS	CDS

gtauriello commented 4 years ago

With ignoring I just meant the part in ORF1a which differs from ORF1ab. Just to be clear...

For the naming used here with ORF1a and ORF1b, we should keep all those variations and map them to ORF1ab (P0DTD1) for both. I suppose one needs to be careful with mutations at genome-position 13468 as they can affect 2 amino acids though but no idea how nextstrain handles that.

It seems that nextstrain already maps the mutations into protein-sequence space and so with an appropriate offset you should be able to easily map ORF1b to ORF1ab. But please do add some sanity checks to make sure that the sequences match (i.e. if you map "K2160E" from ORF1b onto P0DTD1 we expect a 'K' at that position...).

gtauriello commented 4 years ago

A possible followup for this could use data from the China National Center for Bioinformation as done in this related resource from UC Riverside: https://coronavirus3d.org/index.html

gtauriello commented 4 years ago

Two more comments on the above:

Unsure whether that source for mutations is illegally bypassing GISAID data sharing policies (based on discussions in the public_sequence_resource topic of the biohackathon). So we should use it with care probably. The main source of data there is GISAID and Genebank.
Nextstrain is subsampling their phylogenetic tree (see this discussion here). So we may need another approach to get the full set of variations.

tomasMasson commented 4 years ago

I'll give it a look at both points.

tomasMasson commented 4 years ago

It looks like Nextstrain guys are releasing the full dataset (12397 genome) at their viz page https://github.com/nextstrain/ncov/issues/364#issuecomment-622257239, with the raw data living at http://data.nextstrain.org/ncov_global.json. However, I could count only 3123 GISAID genomes (pass the json data though a grep filter in the command line). Additionally, at http://cov-glue.cvr.gla.ac.uk/#/home they released a table with amino acid replacements for the GISAID sequences. The problem with this site is the lack of a download bottom for the data (it is an alpha version, maybe they are going to add it later).