geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Add available coronavirus data to the pipeline #1431

Closed kltm closed 1 year ago

kltm commented 4 years ago

Add available SARS-CoV-2 data to the pipeline

Tagging @pgaudet @cmungall

Questions:

alexsign commented 4 years ago

@cmungall Hi Chris, I have to rely on established general logic when I'm trying to complete missing data. If I get more details provided by the uniprot curators or automatic annotation team then it will get fixed automatically. I'm getting the data update potentially early next week. Once new GPI/GPAP files are ready I'll let you know.

pmasson55 commented 4 years ago

Hello Chris and Alex, It seems to me that the SARS-COV2 proteome is complete. Concerning the SARS-COV ( where there is more information, since so far literature on SARS_COV2 concerns only few structural papers...) here is the UniProt link where you can find all the proteins from the reference strain: https://www.uniprot.org/uniprot/?query=database%3A%28type%3Aembl+AY274119%29&sort=score Hope this helps, tell me if you need more infos...

thomaspd commented 4 years ago

Patrick, if I understand correctly, you do not generally separate out the GO annotations of each protein product of a polyprotein. For example, you don't have separate annotations for nsp1, but instead group them together with the other functions of the replicase polyprotein 1ab. Is that right?

alanbridge commented 4 years ago

Hi Paul,

Patrick is on vacation today so I take the liberty of answering on his behalf, correctly I hope.

We should separate out the GO annotations of each protein product of a polyprotein, so we should have separate annotations, like this:

uniprot PRO chain ID A - GO term B uniprot PRO chain ID C - GO term D

Hope that helps, Alan

redaschi commented 4 years ago

My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is PRO_nnnnn), and having the uniprot resolves reolve https://www.uniprot.org/uniprot/PRO_0000449647

@cmungall I agree that double-barreled delimiters are ugly. If you would not want to store the "parent" UniProt accession number you could use the PURL for the chain, e.g. http://purl.uniprot.org/annotation/PRO_0000449647 is resolved to https://www.uniprot.org/uniprot/P0CW05#PRO_0000449647 (there is again a bug in that it does not jump to the anchor, but at least you end up on the correct entry).

But UniProt curators may like/need to see the "parent" UniProt accession number as well in your editor.

redaschi commented 4 years ago

Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently.

@cmungall I often wish UniProt had consistent product identifiers, but for historic reasons it has not, and for practical reasons many users like IDs with semantics, like in this case seeing whether the identifier is for an isoform (AC-n) or a proteolytic cleavage product (PRO_n).

cmungall commented 4 years ago

Thanks for the explanation @redaschi, my understanding from previous comments in this thread was that PRO_ns could not be decoupled from the parent uniprot ID. I will follow up on identifiers with a separate email

cmungall commented 4 years ago

@alanbridge / @pmasson55 :

We should separate out the GO annotations of each protein product of a polyprotein, so we should have separate annotations, like this: uniprot PRO chain ID A - GO term B uniprot PRO chain ID C - GO term D

This would be great. Would this be possible for automated annotations such as interpro2go as well?

Looking at:

https://www.uniprot.org/uniprot/P0DTD1

There are many PROs in the uniprot entry that are not in the GPI. See https://www.uniprot.org/uniprot/P0DTD1#ptm_processing

We would expect GPI entries for the nsps, for the helicase, the proteinaise

We would also like to see annotations at the pp level. It seems in uniprot this is done for textual annotations:

https://www.uniprot.org/uniprot/P0DTD1#function

but not GO

also for subcellular:

https://www.uniprot.org/uniprot/P0DTD1#subcellular_location

but not for GO subcellular.

When we look at the GPA

curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa | grep P0DTD1

We see they are all at the P0DTD1 level and none at the pp level

redaschi commented 4 years ago

@cmungall UniProt curators do annotated GOs at chain level, check out curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa | grep P0DTC2-PRO I don't know whether for some reason they haven't done yet for P0DTD1 (that question goes to Patrick), or whether this is a release/data sync issue.

Regarding interpro2go, you'd need to ask Rob Finn and Maria Martin, but it would definitely require an additional layer to either UniParc or InterPro or both. InterPro matches are computed on UniParc, which does not contain records for chains, and I'm not convinced that it should, because they are part of the "precursor" sequences (NB: regarding the identifiers discussion, one could argue that this would give us a uniform identifier space for all sequences, but the problem would then be that the identifier would change whenever a curator changes even a single AA, so not really useful for GOA). InterPro stores the match positions of the signatures and if one combines this info with the sequence ranges of the chains, on could determine whether a domain (and the GO terms for it) lay within a specific chain. But this is only really interesting for viruses, where, I believe, the best solution would be that UniProt generates separate entries for the proteolytic cleavage products. UniProt deviates from the 1 gene = 1 entry policy also for other special cases, and for these proteins it would really make a lot of sense.

pmasson55 commented 4 years ago

Concerning P0DTD1 (polyprotein) , there were no papers worth adding when I looked, it was only few structural papers. I annotated (chain specific) for the papers showing the role of ACE2 as receptor for the spike protein of SARS2. I'm currently updating the SARS polyprotein for GO and will update SARS-COV2 accordingly (By similarity).

cmungall commented 4 years ago

InterPro stores the match positions of the signatures and if one combines this info with the sequence ranges of the chains, on could determine whether a domain (and the GO terms for it) lay within a specific chain.

Yes, it wouldn't be so hard to do this. Though of course my preference is that this is done upstream of GO!

But this is only really interesting for viruses, where, I believe, the best solution would be that UniProt generates separate entries for the proteolytic cleavage products. UniProt deviates from the 1 gene = 1 entry policy also for other special cases, and for these proteins it would really make a lot of sense

Given everything I have heard in this thread, I think this could certainly make things a lot easier. These poor PRO IDs seem have a second-class existence that causes a lot of problems, if there were first-class uniprot entries for the cleavage products then a lot of things would just work as expectted.

cmungall commented 4 years ago

@pmasson55 - but this shouldn't affect the GPI file. The GPI file produced by Alex should have all possible annotatable entities, regardless of whether they have annotations or not

cmungall commented 4 years ago

@alexsign can you also make files for SARS-CoV. Or could combine into one coronavirus file

alexsign commented 4 years ago

@cmungall do you want to have 16 entries on GPI file for https://www.uniprot.org/uniprot/P0DTD1 ? One for each PRO id regardless of annotations.

I think much better choice is to use UniProt API https://www.ebi.ac.uk/proteins/api/proteins/P0DTD1

cmungall commented 4 years ago

let's discuss today

On Wed, May 13, 2020 at 7:55 AM Alex Ignatchenko notifications@github.com wrote:

@cmungall https://github.com/cmungall do you want to have 16 entries on GPI file for https://www.uniprot.org/uniprot/P0DTD1 ? One for each PRO id regardless of annotations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-628045786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOOQUHIYWNULZZDLOZLRRKYGDANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@cmungall please take a look at ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi I made additions as we discussed at the meeting. Let me know if you'd like any changes or more info in it.

cmungall commented 4 years ago

thanks Alex!

Unfortunately there are a number of problems here. @kltm we should hold off on doing a new neo load as this will confuse curators.. Alex, maybe we can have a staging area for new changes so it doesn't accidentally get loaded?

I thought this new version would only introduce new cleavage products, but there are a lot more plain uniprot entries there now.

previously there was only one entry for the N nucleoprotein:

UniProtKB       P0DTC9  N       Nucleoprotein   N       protein taxon:2697049   

Now there are 7:

UniProtKB       A0A6C0N5E8      N       Nucleoprotein   N       protein taxon:2697049                   
UniProtKB       A0A6C0T6Z7      N       Nucleoprotein   N       protein taxon:2697049                   
UniProtKB       P0DTC9  N       Nucleoprotein   N       protein taxon:2697049   
UniProtKB       A0A679GC99      N       Nucleoprotein   N       protein taxon:2697049                   
UniProtKB       A0A6C0WXA2      N       Nucleoprotein   N       protein taxon:2697049                   
UniProtKB       A0A6B9VLF5      N       Nucleoprotein   N       protein taxon:2697049                   
UniProtKB       A0A6B9VNN9      N       Nucleoprotein   N       protein taxon:2697049                   

I don't think we want any of the A entries. These are confusing to a curator.

But it's good that we have the full set of cleavage products in here. However, we need the value of the 'Symbol' field to uniquely reflect the entry. Here we have 18 entries that all share the same symbol:

UniProtKB       P0DTD1  rep     Replicase polyprotein 1ab       rep|1a-1b       protein taxon:2697049                   
UniProtKB       P0DTD1-PRO_0000449626   rep     Non-structural protein 8        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449621   rep     Non-structural protein 3        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449620   rep     Non-structural protein 2        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449624   rep     Non-structural protein 6        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449629   rep     RNA-directed RNA polymerase     rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449623   rep     3C-like proteinase      rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449618   rep     Replicase polyprotein 1ab       rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449627   rep     Non-structural protein 9        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449625   rep     Non-structural protein 7        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449619   rep     Host translation inhibitor nsp1 rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449630   rep     Helicase        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449628   rep     Non-structural protein 10       rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449633   rep     2'-O-methyltransferase  rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449622   rep     Non-structural protein 4        rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449632   rep     Uridylate-specific endoribonuclease     rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                
UniProtKB       P0DTD1-PRO_0000449631   rep     Proofreading exoribonuclease    rep|1a-1b       protein taxon:2697049   UniProtKB:P0DTD1                

Here, the 2nd row should have 'nsp8' for a symbol, the 3rd row should have 'nsp3' for a symbol, etc.

Again for spike:

UniProtKB       A0A6B9V081      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6C0X2H7      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UY34      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UY56      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6C0RQ44      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9XJC0      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UYI1      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6C0QGH5      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UZU2      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UZ41      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9UZ68      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       P0DTC2  S       Spike glycoprotein      S|2     protein taxon:2697049                   
UniProtKB       A0A679G9E9      S       Spike glycoprotein      S       protein taxon:2697049                   
UniProtKB       A0A6C0MB05      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6C0N4V2      S       Surface glycoprotein    S       protein taxon:2697049                   
UniProtKB       A0A6B9WHC1      S       Spike glycoprotein      S       protein taxon:2697049                   
UniProtKB       P0DTC2-PRO_0000449648   S       Spike protein S2        S|2     protein taxon:2697049   UniProtKB:P0DTC2                
UniProtKB       P0DTC2-PRO_0000449646   S       Spike glycoprotein      S|2     protein taxon:2697049   UniProtKB:P0DTC2                
UniProtKB       P0DTC2-PRO_0000449649   S       Spike protein S2'       S|2     protein taxon:2697049   UniProtKB:P0DTC2                
UniProtKB       P0DTC2-PRO_0000449647   S       Spike protein S1        S|2     protein taxon:2697049   UniProtKB:P0DTC2                

We have 20 entries that all have the same symbol S

The A accessions should be removed, and the cleavage products should have unique symbols such as S1, S2, S2'

cmungall commented 4 years ago

It might be informative to look at what PRO have done, can you make your GPI look more like this one Alex:

curl -L -s https://proconsortium.org/download/development/pro_sars2.gpi

For example, here are the entries for S and its cleavage products:

PR      P0DTC2  S (SARS2)       spike glycoprotein (SARS-CoV-2) S (SARS2)|S glycoprotein (SARS2)|peplomer protein (SARS2)|E2 (SARS2)|surface glycoprotein (SARS2)|      protein taxon:2697049              NCBIGene:43740568       
PR      000050266       S/SigPep- (SARS2)       spike glycoprotein, signal peptide removed form (SARS-CoV-2)    S/SigPep- (SARS2)|PRO_0000449646|UniProtKB:P0DTC2, 13-1273      protein    taxon:2697049   PR:P0DTC2       NCBIGene:43740568       
PR      000050267       S1 (SARS2)      spike protein S1 (SARS-CoV-2)   S1 (SARS2)|PRO_0000449647|UniProtKB:P0DTC2, 13-685      protein taxon:2697049   PR:P0DTC2       NCBIGene:43740568  
PR      000050268       S2 (SARS2)      spike protein S2 (SARS-CoV-2)   S2 (SARS2)|PRO_0000449648|UniProtKB:P0DTC2, 686-1273    protein taxon:2697049   PR:P0DTC2       NCBIGene:43740568  
PR      000050269       S2' (SARS2)     spike protein S2' (SARS-CoV-2)  S2' (SARS2)|PRO_0000449649|UniProtKB:P0DTC2, 816-1273   protein taxon:2697049   PR:P0DTC2       NCBIGene:43740568  

At this stage I think it might be more straightforward for us to take the protein ontology GPI, convert the IDs to UniProt entries or cleavage PRO IDs

alexsign commented 4 years ago

@cmungall I removed all "A..." accessions from the file and reposted it. I'll try to implement the rest of the requests ASAP, but it need to be coordinated with uniprot because I'm using their data to generate the file. Sorry for delay.

alexsign commented 4 years ago

@cmungall Hi Chris, please check updated GPI file and let me know.

cmungall commented 4 years ago

This is looking a lot better!

Some remaining issues

symbols are still not unique; e.g.

$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi | grep nsp6
UniProtKB       P0DTC1-PRO_0000449640   nsp6    Non-structural protein 6        P0DTC1(3570-3859)       protein taxon:2697049   UniProtKB:P0DTC1                
UniProtKB       P0DTD1-PRO_0000449624   nsp6    Non-structural protein 6        P0DTD1(3570-3859)|rep|1a-1b     protein taxon:2697049   UniProtKB:P0DTD1 

it's not totally clear to me how a curator would choose between these two, they appear to be cleaved from the polyprotein at the same site?

Some of the pps still lack meaningful symbols, e.g

UniProtKB       P0DTC5  P0DTC5  Membrane protein                protein taxon:2697049                   
UniProtKB       P0DTC1  P0DTC1  Replicase polyprotein 1a                protein taxon:2697049                   
UniProtKB       P0DTC8  P0DTC8  Non-structural protein 8                protein taxon:2697049                   

Why not call these M, 1a, and nsp8 as is conventional?

Sometimes the symbol field contains a pipe. It's not clear if your intention is that this is to be interpreted as a separator. The cardinality of this field is 1, so it's just interpreted as a string:

UniProtKB       P0DTC1-PRO_0000449644   GFL|nsp10       Non-structural protein 10       P0DTC1(4254-4392)       protein taxon:2697049   UniProtKB:P0DTC1                
UniProtKB       P0DTD1-PRO_0000449628   GFL|nsp10       Non-structural protein 10       P0DTD1(4254-4392)|rep|1a-1b     protein taxon:2697049   UniProtKB:P0DTD1                

I would have thought nsp10 the natural name, rather than a symbol with an ugly pipe in it?

Again, it's not clear how a curator would decide between these two IDs.

alexsign commented 4 years ago

@cmungall

  1. The symbol is actually unique for the giving protein here (P0DTC1 and P0DTD1) same goes for your example 3. If you look at UniProt entries for them you can clearly see identical names for the both chains: https://www.uniprot.org/uniprot/P0DTC1 https://www.uniprot.org/uniprot/P0DTD1

  2. Totally agree, but if you look at https://www.uniprot.org/uniprot/P0DTC5 you'll see gene name for this is N/A. if that's the case, and I don't have any other alternatives, I have to reuse accession. If I start coming up with my own names, I'm sure I'll get in trouble with UniProt pretty fast ;)

  3. This comes from the UniProt data again: DE RecName: Full=Non-structural protein 10; DE Short=nsp10; DE AltName: Full=Growth factor-like peptide; DE Short=GFL;

Which one should be prioritised is something probably curators can answer. @pmasson55 we need your expertise on the point raised by @cmungall

bmeldal commented 4 years ago

My 2p, from a "user" database of UniProt:

“A... entries”:

They are not yet public, so I have to make an assumption by their AC format that these are Trembl entries. If they are Trembl ACs they should be handled the same way as any other Trembl ACs are handled where we also have a SP entry.

“Symbols”:

I believe these are the gene names/gene symbols. By convention, they are provided by the respective taxon authority, such as the HGNC for human, and probably imported or manually added by UniProt curators (@pmasson55 ?). I didn’t think UniProt had a field “PRO chain symbol”, they only give a name on the website (which I can see in the GPI). @alexsign , were did you get those logical symbols (like nsp6) from? We (at IntAct) have to add them manually (we enrich most fields in our DB for protein interactors from UniProt) so would be good to know if we can import them, too.

I also saw the entries with “N/A” as symbol. @alexsign is right, he can’t arbitrarily add something there in the GPI, it has to come from the underlying UniProt entry. I guess those can be added by UniProt curators (@pmasson55 ?). Do you need a Helpdesk ticket for the entries with missing symbols ;-)

Finally, there are 2 replicase polyproteins in each SARS sequence, R1a and R1ab. They code for the same proteins except for nsp11 and nsp12, which are only found in one ORF, respectively. It’s because the ribosome has a tendency to slip in the nsp11 sequence range resulting in a 1aa frameshift and 2 different products.

Not sure if I've been helpful ;-)

Birgit

cmungall commented 4 years ago

Very helpful @bmeldal !

Tackling the "duplicate" issue first. So the fundamental issue here is that the uniprot datamodel forces each cleavage product to have a single parent. You can't have a single nsp1 shared by the two polyproteins.

IMHO this design decision is akin to saying a protein has a single transcript as parent.

But I assume it's hard to fix this. So the question is how does a curator choose which nsp1 or nsp2 etc to use? I think whether it comes from 1a or 1ab is irrelevant the majority of the time?

Do they annotate both?

Do we pick one as 'canonical/reference'? E.g the one from the longer/shorter pp?

I like what @nataled has done in PRO(tein ontology), we have a single entry for each nsp1-10, and these map to two UniProt-PRO IDs:

PR      000050279       rep/Clv:nsp10 (SARS2)   non-structural protein 10 (SARS-CoV-2)  rep/Clv:nsp10 (SARS2)|growth factor-like peptide (SARS2)|GFL (SARS2)|nsp10 (SARS2)|PRO_0000449644|PRO_0000449628|UniProtKB:P0DTC1, 4254-4392|UniProtKB:P0DTD1, 4254-4392   protein taxon:2697049           NCBIGene:43740578       
bmeldal commented 4 years ago

Very helpful @bmeldal !

You are welcome.

Tackling the "duplicate" issue first. So the fundamental issue here is that the uniprot datamodel forces each cleavage product to have a single parent. You can't have a single nsp1 shared by the two polyproteins.

Correct. My guess is that this is a rare case - maybe restricted to viruses (I don't know enough viral genomes in detail to generalise this slipage phenomenon).

IMHO this design decision is akin to saying a protein has a single transcript as parent.

Well, the UniProt model "pretends" that each chain of a identical "pair" has a single, unique transcript when in fact it comes from the same transcript. We have plenty of inverse cases where we have identical proteins coded by different genes with different UniProt entries (they cause us a different problem ;-) ). Ideally, I think! we would have all PRO chains for the replicase transcripts in the same canonical entry. I don't know how the decision was made to create 2 UniProt entries for what is just one gene product. (In other cases, they merge such entries into one...)

Biology is bloody difficult to express logically!

But I assume it's hard to fix this. So the question is how does a curator choose which nsp1 or nsp2 etc to use? I think whether it comes from 1a or 1ab is irrelevant the majority of the time?

Do they annotate both?

Do we pick one as 'canonical/reference'? E.g the one from the longer/shorter pp?

I think Uniprot, as the reference resource, annotate to both entries where applicable.

We have to make a systematic decision. In IntAct, we decided to mainly annotate to the long product (R1ab) as then we can capture all but one PRO chain (nsp11) under one canonical entry. It's obviously not ideal but if we annotated to both entries where appropriate (nsp1-10) we would duplicate all these interactions. So far, I have not seen complexes involving nsp11 so all Complex Portal entries are to the long form.

I think PDBe have used the same system of annotating to the long form where possible.

I like what @nataled has done in PRO(tein ontology), we have a single entry for each nsp1-10, and these map to two UniProt-PRO IDs:

PR      000050279       rep/Clv:nsp10 (SARS2)   non-structural protein 10 (SARS-CoV-2)  rep/Clv:nsp10 (SARS2)|growth factor-like peptide (SARS2)|GFL (SARS2)|nsp10 (SARS2)|PRO_0000449644|PRO_0000449628|UniProtKB:P0DTC1, 4254-4392|UniProtKB:P0DTD1, 4254-4392   protein taxon:2697049           NCBIGene:43740578       

It works for PRO ontology because they are a proteoform-centric ontology and not a gene product-centric encyclopaedia. How long do we have to discuss the merits of either approach ;-)

pmasson55 commented 4 years ago

Hi Chris, Alex and Birgit,

So first point the two polyproteins. This is indeed an unusual case, concerning some viruses. They tend to do ribosomal frameshifting in order to make few replication-related proteins. We have decided to annotate both forms the same way (for the chains that are identical) and put the publications in both entries, in SwissProt. Concerning conventional GO, we also annotated both entries the same way with the publications, for SARS and SARS2, so each polyprotein (R1A and R1AB) has the same info for the identical chains. The idea is that if someone look at one of the two entries, he should have access to the all corresponding information. To resolve this issue, we plan in UniProt/SwissProt to split the polyproteins in order to have one accession number for one cleavage product. It's an ongoing project that takes time since it concerns all SwissProt entries, not only viruses... Now for the GO-CAM, I would only use one of the entries, probably the longest R1AB, which possesses all the replication proteins... Now the second point concerning the gene names, it should be fixed by us if the information is not present: I will go through all entries for SARS and SARS-2 and make sure they all have a proper gene name. For example, the membrane protein P0DTC5 should have M as gene name. I'll fix that. Concerning the point 3 -> We have: DE RecName: Full=Non-structural protein 10; DE Short=nsp10; DE AltName: Full=Growth factor-like peptide; DE Short=GFL; that gives UniProtKB P0DTC1-PRO_0000449644 GFL|nsp10 It seems that it took both short names with a weird symbol in between. If it's not too complicated I would just use the first short name which is in that case nsp10. Hope that was clear, Patrick

bmeldal commented 4 years ago

Thanks, Patrick.

we plan in UniProt/SwissProt to split the polyproteins in order to have one accession number for one cleavage product.

Does that mean that the R1a and R1ab entries get demerged and each nsp PRO chain gets one unique, canonical entry? Happy days for us detangling it again! Please give us a heads up when this happens ;-)

Now the second point concerning the gene names, it should be fixed by us if the information is not present: I will go through all entries

Thank you!

DE Short=nsp10;

I forgot about this line as it doesn't appear on the website. I saw it in the flat file that you released in April - and forgot again once I could use the website...

I agree, just use the first short name.

Caveat: the nsp-style short name is not always the first/recommended short name, sometimes it's the alternative short name or even an alternative FULL name (see: nsp12-nsp16 for P0DTD1). Makes it a bit confusing as the nsp-style is very easy to read and remember for human users. But I digress...

Viruses are fickle things...

redaschi commented 4 years ago

Hi Birgit, the demerge of the polyproteins is a big piece of work for which we have no timeline yet, but rest assured that IntAct will be among the first to hear about it ;-) The polyprotein itself will keep its CHAIN annotations, so IntAct could transition to ACs when convenient. We will link the entries somehow (e.g. add to each FT CHAIN an xref to the AC that describes that protein in detail - the way we link has not been discussed yet, this is just one possibility).

kltm commented 4 years ago

@alexsign Apologies for the long and confusing thread (we should probably start splitting things out of here). I just wanted to follow up on https://github.com/geneontology/go-site/issues/1431#issuecomment-611799993 Would it be possible to get the GAF (ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf) available as a .gz, like the other data products we get from you?

cmungall commented 4 years ago

Since this ticket has morphed from an issue to a (v useful) repository of information and discussion about IDs, I wanted to point out via @chris-grove that with the Alliance we have made a BGI file for SARS-CoV-2:

http://tazendra.caltech.edu/~azurebrd/var/work/chris/coronavirus_biogrid.json

this is "gene" centric, and has single entries for nsps etc.

With my Alliance hat on, we want to be able to project GO annotations from whatever GO chooses as the annotation unit. This will be unreliable if we don't have 1:1 mappings. E.g. if we do the conventional thing of mapping by uniprot access then annotations from one cleavage product/"gene" will transfer to others on the same pp.

alexsign commented 4 years ago

@kltm Following files are available now. ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi.gz ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa.gz ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf.gz

kltm commented 4 years ago

@alexsign Great--thank you! I'll try can get these bolted in and start testing quickly.

pmasson55 commented 4 years ago

@alexsign, @cmungall Hi, So here is the list of unique identifiers that can be used for Noctua as we discussed earlier: FROM R1AB_SARS2 (P0DTD1): nsp1 P0DTD1:PRO_0000449619 nsp2 P0DTD1:PRO_0000449620 nsp3 P0DTD1:PRO_0000449621 nsp4 P0DTD1:PRO_0000449622 nsp5 P0DTD1:PRO_0000449623 nsp6 P0DTD1:PRO_0000449624 nsp7 P0DTD1:PRO_0000449625 nsp8 P0DTD1:PRO_0000449626 nsp9 P0DTD1:PRO_0000449627 nsp10 P0DTD1:PRO_0000449628 nsp12 (Pol) P0DTD1:PRO_0000449629 nsp13 (Hel) P0DTD1:PRO_0000449630 nsp14 (exoN) P0DTD1:PRO_0000449631 nsp15 P0DTD1:PRO_0000449632 nsp16 P0DTD1:PRO_0000449633

and FROM R1A_SARS2 (P0DTC1): unique nsp11 P0DTC1:PRO_0000449645

That should solve the duplicate problem concerning the polyprotein issue.

bmeldal commented 4 years ago

I thought we were going to use hyphens as separators, e.g. P0DTD1-PRO_0000449633?

cmungall commented 4 years ago

We are :-) this list is just for Alex to select which member of the duplicate pair to select

I would prefer this done by a field in uniprot, e.g. is_reference field, rather than a secret file, and this is coordinated with what you use, but one step at a time!

On Wed, May 27, 2020 at 11:17 AM Birgit Meldal notifications@github.com wrote:

I thought we were going to use hyphens as separators, e.g. P0DTD1-PRO_0000449633?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-634850247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJYW5JJSZZGOGZMAY3RTVKLRANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@cmungall @pmasson55 Can you guys please take a look at this: https://www.uniprot.org/uniprot/P0DTD1.txt because this is the data I have an access to, and let me know how can I come up to the name above without literally typing in data into the GPI file.

pmasson55 commented 4 years ago

@alexsign, @cmungall Hello, I'll try to answer your question Alex, hoping I understood correctly, From the entries, the best way would be to use the short name form the DE (description line) to match it with the FT chain ID. Now I see that the P0DTD1 entry should be fixed to be able to do that. For example, the nsp12 is the polymerase and we have two short names ( both are fine actually) DE RecName: Full=RNA-directed RNA polymerase; DE Short=Pol; DE Short=RdRp; In that case I could pass the second one as alternative name if that helps, then we would have pol instead of nsp12 which is fine since it's a polymerase... For nsp13 and nsp14, the short recnames are hel and ExoN which are also fine, since it's the helicase and the exonuclease. Having them like that in Noctua is fine... The only issue would be nsp15 and nsp16 that have no shortnames, but I can easily add them (they are now designed as AltName: Full=nsp15). Alex, would it be a good way and simple for you to retrieve them?

cmungall commented 4 years ago

That would be great if you could add them

Will we also get nspX as synonyms? I note we are missing a lot of these

On Thu, May 28, 2020 at 9:21 AM pmasson55 notifications@github.com wrote:

@alexsign https://github.com/alexsign, @cmungall https://github.com/cmungall Hello, I'll try to answer your question Alex, hoping I understood correctly, From the entries, the best way would be to use the short name form the DE (description line) to match it with the FT chain ID. Now I see that the P0DTD1 entry should be fixed to be able to do that. For example, the nsp12 is the polymerase and we have two short names ( both are fine actually) DE RecName: Full=RNA-directed RNA polymerase; DE Short=Pol; DE Short=RdRp; In that case I could pass the second one as alternative name if that helps, then we would have pol instead of nsp12 which is fine since it's a polymerase... For nsp13 and nsp14, the short recnames are hel and ExoN which are also fine, since it's the helicase and the exonuclease. Having them like that in Noctua is fine... The only issue would be nsp15 and nsp16 that have no shortnames, but I can easily add them (they are now designed as AltName: Full=nsp15). Alex, would it be a good way and simple for you to retrieve them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-635450609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMON4PVVV2RNEZCS4WQLRT2FLVANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@pmasson55 I had a chat with UniProt production team. Any changes incorporated into next week SwissProt freeze I should be able to pull into GPI file.

pmasson55 commented 4 years ago

@alexsign, @cmungall

In fact the entry you showed us (P0DTD1.txt) corresponds to the old version of R1AB_SARS2. To be able to fetch the latest version , go on https://www.uniprot.org/, on the top right side of the page, the link in red for coronavirus. Here is the link https://covid-19.uniprot.org/uniprotkb/P0DTD1 There, all the nsps are mentionned in shortname. Tell me if it's ok with that to be able to make your files? Thx and have a nice weekend.

bmeldal commented 4 years ago

FYI: We had a heldesk message today from a user asking for resolvable PRO chain IDs. I replied that it's on its way but that we have no ETA yet. Please let me know when you think the UniProt site will resolve them.

redaschi commented 4 years ago

What do you mean by "resolvable PRO chain IDs"? Can you pls copy/paste the exact user question? Thanks!

bmeldal commented 4 years ago

Hi @redaschi

I was trying to find the part of the discussion where we debated

P0DTD1:PRO_0000449619 vs P0DTD1#PRO_0000449619 vs P0DTD1-PRO_0000449619

... can't find it, must have been in the email thread...

But what we decided was that P0DTD1-PRO_0000449619 should be a resolvable ID when people search in UniProt. It's what IntAct have been using for >15 years (ask Sandra for the history). But if you hit the website with it it can't find anything or only the canonical entry.

The user comment is:

"IntAct uses the following type of Ids/URLs: http://www.uniprot.org/uniprot/P0DTC1-PRO_0000449645

This is not a valid UniProt URL.

The part after the "-" is ignored, e.g., you can put anything there: http://www.uniprot.org/uniprot/P0DTC1-PRO_0000123456

A slightly better, but still unsatisfactory solution is to use this kind of link, which includes an anchor to the protein chain: http://www.uniprot.org/uniprot/P0DTC1#PRO_0000449645

Can you please work with UniProt to provide proper accession numbers and URLs so we can use these URLs in automated workflows?"

I was only trying to provide everyone with more info that user do need these ID to be searchable :)

redaschi commented 4 years ago

hi birgit, i appreciate your informing us of user requests :) the user correctly points out that the URL he found at IntAct does not work because it is not a valid UniProt URL. The IntAct website should use the URLs with an anchor (where i hope a developer will take mercy on me and finally fix that bug). unfortunately, the user does not expalin why that URL is 'unsatisfactory' for her workflow. could you direct her to the uniprot helpdesk, please? thanks! nicole

cmungall commented 4 years ago

Can we get some guidance on what the appropriate style of prefixed identifier to use here? In this ticket we've gone around having bare chain IDs (e.g UniProtKB:PRO_nnnnn), double-barreled IDs with every combination of hash, dash, colon...

redaschi commented 4 years ago

hi chris, i requested a new CURIE prefix for uniprot chain identifiers at identifiers.org. they are currently processing it. hopefully you can soon use, e.g. "uniprot.chain:PRO_0000016681" and they'll resolve it to http://purl.uniprot.org/annotation/PRO_0000016681, which is the uniprot URI for chains, and uniprot.org resolves that to the correct web page (nb: the anchor problem seems to be a browser issue). i hope that helps.

cmungall commented 4 years ago

OK, this is useful, it seemed earlier that we needed to keep the chain ID affixed to the parent accession, e.g. UniProtKB:Pnnnn-PRO_nnnn, but I prefer not to have composite IDs, so this is good.

It would be good to sync any changes with IntAct so that we all refer to these in the same way

Note in GO we use prefixed IDs of the form DB:LocalID. For example, UniProtKB:P08069

It seems that URL resolution will not work if we use UniProtKB:PRO_0000016681, so we'll have to add a new prefix to our registry. For consistency with identifiers.org we could just go with uniprot.chain as the prefix (conventionally we capitalize, and use underscores rather than dots, but there is nothing preventing us going with uniprot.chain). So the prefixed ID would be uniprot.chain:PRO_0000016681. Alex, would this work on your end?

On Thu, Jun 4, 2020 at 8:39 AM Nicole Redaschi notifications@github.com wrote:

hi chris, i requested a new CURIE prefix for uniprot chain identifiers at identifiers.org. they are currently processing it. hopefully you can soon use, e.g. "uniprot.chain:PRO_0000016681" and they'll resolve it to http://purl.uniprot.org/annotation/PRO_0000016681, which is the uniprot URI for chains, and uniprot.org resolves that to the correct web page (nb: the anchor problem seems to be a browser issue). i hope that helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-638931694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJ3D3XFZADJDGMFMCLRU6543ANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@redaschi @cmungall Actually, I don't like it at all. It will be inconsistent with the other UniProtKB prefixes in https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml (search for UniProtKB) It will also require special handling on the both side of the GO annotations pipeline (import/export) and change of curator tool Protein2GO, which is not going to be easy due to lack of resources. If the main concern is the link back to the website then simple replace from P12345-PRO to P12345#PRO solves it, is it not? Plus UniProtKB IDs for isoforms like P12345-2 has "-" in the annotation files already. This is certainly not my decision to make. But, at the very least for consistency I would suggest to use UniProtKB-PRO or UniProtKB-CHAIN prefix.

Another thing to consider is whether the UniProt PRO ids will be stay unique.

redaschi commented 4 years ago

hi alex, chris,

i had requested the CURIE "uniprot.chain:PRO_..." from identifiers.org because i thought it could be of general use (seeing that they already had "uniprot.isoform") and that it may help to solve your problem. if it does not, the GOA project can of course make its own CURIE/xref, with e.g. a UniProtKB-Chain prefix (more similar to the prefix style you seem to have at http://amigo.geneontology.org/xrefs). you just have to take care of the resolution yourselves. as alex pointed out, it is not difficult to build a valid URL (or uniprot PURL) from AC-PRO. what you cannot do is go back from the uniprot PURL to AC-PRO.

the uniprot PRO "uniqueness" is a very interesting question: they are currently unique in the sense that each PRO is only in one uniprot entry. but as i have learned in this thread from patrick, curators (have to) assign 2 different PRO to the same protein when they create different entries for the (precursor) products of ribosomal frameshifting. this makes no sense and leads to the problem that external dbs like GOA and IntAct do not know which of the 2 PROs they should annotate. so ideally these protein chains should have only one PRO, but that means that we would have the same PRO in 2 different uniprot entries, i.e. as long as you annotate to AC-PRO instead of PRO, your problem remains the same. of course, ideally i want to make separate uniprot entries for each chain of a viral polyprotein and then you could annotate to the AC instead. but meanwhile, we'll have to find a way to muddle through with what we got.

cmungall commented 4 years ago

I'll leave it to you to decide what the best form of ID is - I just want it to be consistent, within GO, and preferably with other databases like IntAct!

Note that currently the GPI is using a dash:

$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi | grep nsp3
UniProtKB       P0DTC1-PRO_0000449637   PL-PRO|nsp3     Non-structural protein 3        P0DTC1(819-2763)        protein taxon:2697049   UniProtKB:P0DTC1                
UniProtKB       P0DTD1-PRO_0000449621   PL-PRO|nsp3     Non-structural protein 3        P0DTD1(819-2763)|rep|1a-1b      protein taxon:2697049   UniProtKB:P0DTD1     

but the GAF is using a colon:

$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf | grep PRO_0000449637
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0004197      GO_REF:0000024  ISS     UniProtKB:P0C6U8:PRO_0000338257 F       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0004197      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 F       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0019785      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 F       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039502      GO_REF:0000024  ISS     UniProtKB:P0C6U8:PRO_0000338257 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039502      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039547      GO_REF:0000024  ISS     UniProtKB:P0C6U8:PRO_0000338257 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039548      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039579      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039644      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039714      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 C       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0039722      GO_REF:0000024  ISS     UniProtKB:P0C6U8:PRO_0000338257 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0070536      GO_REF:0000024  ISS     UniProtKB:P0C6U8:PRO_0000338257 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 
UniProtKB       P0DTC1  P0DTC1:PRO_0000449637           GO:0071108      GO_REF:0000024  ISS     UniProtKB:P0C6X7:PRO_0000037311 P       Replicase polyprotein 1a                protein taxon:269704920200506 UniProt 

note also the GAF is missing the symbols (nsp3, etc)