SWISS-MODEL / covid-19-Annotations-on-Structures

Mapping sequence data onto structures for the Covid-19 Biohackathon April 2020
https://github.com/virtual-biohackathons/covid-19-bh20/wiki/Annotations-on-Structures
MIT License
2 stars 8 forks source link

Extract annotations from IntAct / ComplexPortal #25

Open gtauriello opened 4 years ago

gtauriello commented 4 years ago

The two EBI resources IntAct and ComplexPortal contain curated data on experimentally observed interactions between proteins.

From the EBI webpage you find links to query the IntAct webpage or download the IntAct data in PSI-MI TAB format here: [ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psi25/datasets/Coronavirus.zip].

Notes:

Also: Birgit Meldal from the IntAct / ComplexPortal team is available in the Slack channel for questions and I will update this comment if we get new input and links that can be of general use.

bmeldal commented 4 years ago

I'm here!

Just a small, political, correction: there are 10 members of the IMEx consortium that curate into IntAct. MINT and IntAct itself are just 2 of them. E.g., DIP have also contributed many SARS publications in this last month.

And yes, we have also decided to annotate to the longer polyprotein in SARS-CoV and SARS-CoV-2 (e.g. R1AB, P0DTD1) except for the small protein nsp11 that is only translated from the short polyprotein of SARS-CoV-2 (R1a, P0DTC1). The long polyprotein codes for nsp12 at the ribosomal slippage site.

For Complex Portal you can find the data via our organism page: https://www.ebi.ac.uk/complexportal/complex/organisms It also has a WS JSON endpoint but only via the individual AC queries. Or download the whole species file in xml via: ftp://ftp.ebi.ac.uk/pub/databases/intact/complex/current/psi30/

Any questions, please ask! Slack ID is the same as GitHub.

gtauriello commented 4 years ago

@all-contributors please add @bmeldal for ideas, content

allcontributors[bot] commented 4 years ago

@gtauriello

I've put up a pull request to add @bmeldal! :tada:

bmeldal commented 4 years ago

Thank you!

D-Barradas commented 4 years ago

@gtauriello so the annotations are only for the virus proteins, right?

bmeldal commented 4 years ago

IntAct & ComplexPortal have both, virus and human proteins. Not sure if that was your question, though ;-)

gtauriello commented 4 years ago

@D-Barradas also unsure about the question.

Personally, I would start by looking at all interactions returned in the query above (or the download) and extract any positional data you can find. The query should restrict it to coronavirus-relevant interactions. The annotation system works for any UniProtKB AC and not just the virus proteins. So you can safely have annotations mapped e.g. on structures for the human proteins involved in those interactions...

D-Barradas commented 4 years ago

Hi @gtauriello @bmeldal : sorry for the cryptic question, basically you have answer my question, I already uploaded my annotations, in that process, I found that the server does not like the the PRO_ 👍 Couldn't find P0DTD1-PRO_0000449623 by UniProt AC or MD5. <- this was the warning

gtauriello commented 4 years ago

Yes for the polyproteins, you will need to do some extra mapping. Assuming you have a position within P0DTD1-PRO... (or P0DTC1-PRO...) you need to proceed as follows:

  1. Extract the start/end of those PRO_... from UniProt: P0DTD1 and P0DTC1. You can do this either manually or parse the UniProt-files looking for the "FT CHAIN" entries (P0DTD1 and P0DTC1)...
  2. You should be able to use the start in UniProt to offset your data.

As an example: say you have position 10 in P0DTD1-PRO_0000449623. From UniProt you see that PRO_0000449623 covers positions 3264-3569. That means that pos. 10 in P0DTD1-PRO_0000449623 corresponds to pos. 3273 in P0DTD1.

Also any position that you find in P0DTC1, should be mapped to P0DTD1 as long as it's not in the "Non-structural protein 11" (i.e. position >= 4393 of P0DTC1). Technically you could also duplicate all those annotations but it's easier to have them just once...

@bmeldal I am assuming above that your positions are 1-indexed: i.e. that the first AA of a protein is at position "1" and not "0". Is that correct?

bmeldal commented 4 years ago

Morning,

Yes, that is all correct! It's a shame that UniProt doesn't allow the PRO-chain search by default but @gtauriello 's workaround is correct. And yes, chain positions are 1-indexed. We should only have used P0DTD1 except for nsp11.

gtauriello commented 4 years ago

A nice example is here (thx @D-Barradas for pointing me to it). I quickly turned it manually into an annotation (see project link here):

P0DTC2,481,487,#FF0000,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation disrupting strength (p.Asn481_Asn487delinsThrProProAlaLeuAsn)
P0DTC2,493,493,#00FF00,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation decreasing strength (p.Gln493Asn)
P0DTC2,493,493,#00FF00,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation decreasing strength (p.Gln493Tyr)
P0DTC2,501,501,#FF0000,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation disrupting strength (p.Asn501Thr)
Q9BYF1,18,633,#0000FF,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,sufficient to bind (ecd)

I will make sure that on our side we can nicely display annotations on both subunits of heteromers (currently you can see either ACE2 or spike annotations but not both at the same time).

Having a script that scans IntAct to extract a csv like above automatically (with some clever coloring logic) would be a really useful addition.

gtauriello commented 4 years ago

As a starting point here some files (thx @D-Barradas ): Archive.zip

It contains:

Still TODO:

gtauriello commented 4 years ago

So we ended up doing another script to extract PPI between SARS-CoV-2 and human proteins from IntAct. The script is loosely based on the one above and attached here: PPI-IntAct.zip

The result of it is a dedicated page on our server listing the structural coverage for all those interaction partners: https://swissmodel.expasy.org/repository/species/2697049/interactions

bmeldal commented 4 years ago

There's a typo on https://swissmodel.expasy.org/repository/species/2697049

"IntAct lists interactions derived from literature curation or direct user submissions. We extracted those interactions and list the ones between SARS-CoV-2 and human host proteins with their structural coverage in a decicated interaction page." should read dedicated

Freudian slip??? I know the data is not yet saturated... ;-)

Great work!

Please remember to cite IntAct in any resulting manuscripts.

bmeldal commented 4 years ago

Feature suggestion:

On the interactions page: https://swissmodel.expasy.org/repository/species/2697049/interactions

Allow the user to collapse the list for a given protein again without having to open another one. When the list is long (eg spike) it becomes difficult to navigate the page.

gtauriello commented 4 years ago

Oops good point with the typo. I must have been thirsty when I wrote that... ;-) The list gets collapsed as soon as you choose another one but we can add the feature. Doesn't hurt...