[WIP] Scripts to retrieve data from Nextstrain

tomasMasson commented 4 years ago

It's only a draft script, but we can use it to start discussing ideas and make some refactoring. We can also add the code from Didier. @gtauriello @D-Barradas

D-Barradas commented 4 years ago

@tomasMasson Thank you very munch. My code is brute force compared with yours. I think from ["node_attrs"] we should keep ['genbank_accession'], ['gisaid_epi_isl'] and ['author'] , and of course the id of the Children and the mutations

gtauriello commented 4 years ago

@all-contributors please add @tomasMasson for content, code

allcontributors[bot] commented 4 years ago

@gtauriello

I've put up a pull request to add @tomasMasson! :tada:

gtauriello commented 4 years ago

@all-contributors please add @D-Barradas for content

allcontributors[bot] commented 4 years ago

@gtauriello

I've put up a pull request to add @D-Barradas! :tada:

tomasMasson commented 4 years ago

First refactoring of the code. Output is now a .csv file with the following headers (Protein, Mutation, Isolate, Author and GISAID). Omitted the genbank field because not all the samples have it (and we have the UniProtKD AC). @gtauriello

schdaude commented 4 years ago

Hey, sweet parsing! Does anybody know what reference sequences nextstrain uses? I ran the parsing script locally and checked the consistency with the underlying uniprot sequences. For a mutation of the form K2160E I assume that the reference one letter code at position 2160 is K and in that bug it has been mutated to E.

That example comes from the following line in the csv: P0DTD1,K2160E,Hangzhou/ZJU-07/2020,Yao et al,EPI_ISL_416425

However, that location in P0DTD1 is a C. That seems to occur 274 out of 1220 times.

puzzled...

thats the hacky code I'm running btw:

from utils import uniprot

csv_data = open("nextstrain_data.csv", 'r').readlines()[1:]

acs = set()
for line in csv_data:
  acs.add(line.split(',')[0])

sequences = dict()
for ac in acs:
  sequences[ac] = uniprot.seq_from_ac(ac)

for line in csv_data:
  ac = line.split(',')[0]
  mutation = line.split(',')[1]
  orig = mutation[0]
  num = int(mutation[1:len(mutation)-1])
  if sequences[ac][num-1] != orig:
    print(line.strip() + "  uniprot says: " + sequences[ac][num-1])

tomasMasson commented 4 years ago

Maybe i am messing out with the gene product names. I understand that ORF1b refers to poliprotein 1ab (P0DTD1), but I might be wrong. Below is attached the node with the conflict (that's all the data available at Nextstrain: "branch_attrs": { "labels": { "aa": "ORF1b: K2160E" }, "mutations": { "ORF1b": [ "K2160E" ], "nuc": [ "T1405C", "G9802T", "A19945G", "C25267T", "C27615G" ] } }, "name": "Hangzhou/ZJU-07/2020", "node_attrs": { "age": { "value": "0" }, "author": { "author": "Yao et al", "value": "Yao et al" }, "clade_membership": { "value": "unassigned" }, "country": { "confidence": { "China": 1.0 }, "entropy": -1.000088900581841e-12, "value": "China" }, "div": 4.999414825011929, "division": { "value": "Zhejiang" }, "gisaid_epi_isl": { "value": "EPI_ISL_416425" }, "host": { "value": "Human" }, "location": { "value": "Hangzhou" }, "num_date": { "confidence": [ 2020.0915300546449, 2020.0915300546449 ], "value": 2020.0915300546449 }, "originating_lab": { "value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003" }, "recency": { "value": "One month ago" }, "region": { "value": "Asia" }, "sex": { "value": "Male" }, "submitting_lab": { "value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003" }, "url": "https://www.gisaid.org" } }

gtauriello commented 4 years ago

It's good to double-check with the sequences (I had mentioned that in the issue #10 too).

So according to their documentaion they use the GenBank entry MN908947 as reference. According to our own documentation, the MN908947 sequence is identical to the NCBI reference sequence NC_045512. But probably we need to double-check that this is actually the case and that the sequences match the UniProt sequences...

gtauriello commented 4 years ago

So my guess would be that to map ORF1b to P0DTD1 you need an offset which should correspond to the length of the ORF1a stretch (4401 AA (== (13469-266)/3) looking at their ORF1a start/end positions). So would a '+ 4401' work for the mapping? (it does for the K2160 case...)

schdaude commented 4 years ago

4401 seems to be a fantastic number, we're on the right track here... hacking this offset for all identifiers originating from ORF1b reduces the output of the code above to: P0DTD1,C5865Y,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: Y P0DTC8,S84L,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: L P0DTD1,V1997A,Netherlands/NoordBrabant_29/2020,Nieuwenhuijse et al,EPI_ISL_414538 uniprot says: A P0DTD1,F3606L,Nanchang/JX176/2020,Li et al,EPI_ISL_421261 uniprot says: L P0DTD1,S765P,USA/NY-NYUMC1/2020,Chen et al,EPI_ISL_414639 uniprot says: P P0DTD1,V739I,Belgium/DV-0324117/2020,Joan Marti-Carreras et al,EPI_ISL_420374 uniprot says: I P0DTD1,F3606L,Belgium/ULG-7019/2020,Keith et al,EPI_ISL_417025 uniprot says: L P0DTC2,P943S,Belgium/BJ-030767/2020,Joan Marti-Carreras et al,EPI_ISL_420323 uniprot says: S P0DTD1,*4379Y,Portugal/PT0090/2020,Guiomar et al et al,EPI_ISL_421493 uniprot says: Y P0DTD1,T2187N,Netherlands/NA_24/2020,Nieuwenhuijse et al,EPI_ISL_415481 uniprot says: N P0DTD1,L4715P,Belgium/SB-030990/2020,Joan Marti-Carreras et al,EPI_ISL_420346 uniprot says: P P0DTC2,G614D,Australia/VIC123/2020,Seemann et al,EPI_ISL_419731 uniprot says: D P0DTC5,M175T,Belgium/BC-03016/2020,Vanmechelen et al,EPI_ISL_415157 uniprot says: T P0DTD1,I265T,Senegal/610/2020,Dia et al,EPI_ISL_420075 uniprot says: T P0DTC3,H57Q,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: Q P0DTC2,G614D,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: D P0DTC3,H57Q,Belgium/BCM-0324160/2020,Joan Marti-Carreras et al,EPI_ISL_420417 uniprot says: Q P0DTD1,L4715P,USA/NY-NYUMC22/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_418968 uniprot says: P

gtauriello commented 4 years ago

Hmmm. Maybe we need to check with our friends at nextstrain for that...

I noticed a pattern where all those errors seem swapped (i.e. the result of the mutation seems to match the reference sequence). So I checked the "nuc" entry of the mutation (i.e. what changed in the genome) and compared with the reference genome. And alas I saw the same swapped data. So there seems something off with the input data...

I will open an issue on their github...

In the meantime I suppose we can just continue and mark the non-fitting entries somehow in the annotation? We could check if they fit the "reversed" assumption (e.g. wrong "L4715P" should be "P4715L") so that we can track if there are any other inconsistencies....

schdaude commented 4 years ago

Thanks @gtauriello for reporting to the nextstrain people! So to conclude, the mutations they report are "following the branches" of the phylogenetic tree they constructed. As far as I understand we can not only have those back mutations but also mutations x->y where none of x and y match the reference uniprot sequence. As we have a hard time to include and display phylogeny, we need to define what we want to show in the end.

The observed mutation events? That's the information we currently have in the csv. But I'm not sure how relevant this information is without the phylogeny.

All possible amino acids we ever observed at a certain position? That would give some idea of variation.

Perform a separate annotation for each sequenced virus? Rational here is that in the current data we have the mutation relative to its ancestor but not all mutations it picked up during evolution.

Other ideas?

gtauriello commented 4 years ago

Indeed. We definitely need to collect mutations on the same position. I would propose to parametrize the code so that one can have multiple annotations:

Color by the number of distinct amino acids observed at that position (and list them in the annotation)
Color by the number of mutation events observed at that position (and list the events in the annotation)

While it might be interesting to do item 1 above while grouping the annotations by some criteria like country of origin (or a list of countries, e.g. "Europe"), I feel like the philogeny data structure of nextstrain would not be the way to go and we should get in touch with the hackathon topic "pangenome" (and/or "philogeny"). An MSA with all strains seems like the better starting point there...

tomasMasson commented 4 years ago

Thank you for clarifying the issue. I'm in the Phylogeny channel, and we are working in the the sequence retrieval (~ 320 genomes deposited in NCBI) and subsequent alignment. We could use this MSA to extract positions containing mutations.

gtauriello commented 4 years ago

Sorry if my reorga of linking issues to PRs triggered too many emails...

@tomasMasson will the philogeny people only look at those 320? I mean nextstrain has 10x more genomes. Would be a waste to ignore that data. I know that GISAID is not the nicest resource to work with but maybe one can bypass it in some reasonable ways at least to make analyses available. I mean nextstrain seems to be able to do just that. Did you guys look at the the work done at the UCSC Genome Browser?

For the purpose of this issue/PR I would say that we do what we can with the nextstrain data and then move on to other data sources...

schdaude commented 4 years ago

I directly pushed some commits into the master branch from @tomasMasson that implement pretty much the two items described by @gtauriello above.

SWISS-MODEL / covid-19-Annotations-on-Structures

[WIP] Scripts to retrieve data from Nextstrain #24