Closed kltm closed 1 year ago
@cmungall Hi Chris, I have to rely on established general logic when I'm trying to complete missing data. If I get more details provided by the uniprot curators or automatic annotation team then it will get fixed automatically. I'm getting the data update potentially early next week. Once new GPI/GPAP files are ready I'll let you know.
Hello Chris and Alex, It seems to me that the SARS-COV2 proteome is complete. Concerning the SARS-COV ( where there is more information, since so far literature on SARS_COV2 concerns only few structural papers...) here is the UniProt link where you can find all the proteins from the reference strain: https://www.uniprot.org/uniprot/?query=database%3A%28type%3Aembl+AY274119%29&sort=score Hope this helps, tell me if you need more infos...
Patrick, if I understand correctly, you do not generally separate out the GO annotations of each protein product of a polyprotein. For example, you don't have separate annotations for nsp1, but instead group them together with the other functions of the replicase polyprotein 1ab. Is that right?
Hi Paul,
Patrick is on vacation today so I take the liberty of answering on his behalf, correctly I hope.
We should separate out the GO annotations of each protein product of a polyprotein, so we should have separate annotations, like this:
uniprot PRO chain ID A - GO term B uniprot PRO chain ID C - GO term D
Hope that helps, Alan
My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is
PRO_nnnnn
), and having the uniprot resolves reolve https://www.uniprot.org/uniprot/PRO_0000449647
@cmungall I agree that double-barreled delimiters are ugly. If you would not want to store the "parent" UniProt accession number you could use the PURL for the chain, e.g. http://purl.uniprot.org/annotation/PRO_0000449647 is resolved to https://www.uniprot.org/uniprot/P0CW05#PRO_0000449647 (there is again a bug in that it does not jump to the anchor, but at least you end up on the correct entry).
But UniProt curators may like/need to see the "parent" UniProt accession number as well in your editor.
Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently.
@cmungall I often wish UniProt had consistent product identifiers, but for historic reasons it has not, and for practical reasons many users like IDs with semantics, like in this case seeing whether the identifier is for an isoform (AC-n) or a proteolytic cleavage product (PRO_n).
Thanks for the explanation @redaschi, my understanding from previous comments in this thread was that PRO_ns could not be decoupled from the parent uniprot ID. I will follow up on identifiers with a separate email
@alanbridge / @pmasson55 :
We should separate out the GO annotations of each protein product of a polyprotein, so we should have separate annotations, like this: uniprot PRO chain ID A - GO term B uniprot PRO chain ID C - GO term D
This would be great. Would this be possible for automated annotations such as interpro2go as well?
Looking at:
https://www.uniprot.org/uniprot/P0DTD1
There are many PROs in the uniprot entry that are not in the GPI. See https://www.uniprot.org/uniprot/P0DTD1#ptm_processing
We would expect GPI entries for the nsps, for the helicase, the proteinaise
We would also like to see annotations at the pp level. It seems in uniprot this is done for textual annotations:
https://www.uniprot.org/uniprot/P0DTD1#function
but not GO
also for subcellular:
https://www.uniprot.org/uniprot/P0DTD1#subcellular_location
but not for GO subcellular.
When we look at the GPA
curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa | grep P0DTD1
We see they are all at the P0DTD1 level and none at the pp level
@cmungall UniProt curators do annotated GOs at chain level, check out
curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa | grep P0DTC2-PRO
I don't know whether for some reason they haven't done yet for P0DTD1 (that question goes to Patrick), or whether this is a release/data sync issue.
Regarding interpro2go, you'd need to ask Rob Finn and Maria Martin, but it would definitely require an additional layer to either UniParc or InterPro or both. InterPro matches are computed on UniParc, which does not contain records for chains, and I'm not convinced that it should, because they are part of the "precursor" sequences (NB: regarding the identifiers discussion, one could argue that this would give us a uniform identifier space for all sequences, but the problem would then be that the identifier would change whenever a curator changes even a single AA, so not really useful for GOA). InterPro stores the match positions of the signatures and if one combines this info with the sequence ranges of the chains, on could determine whether a domain (and the GO terms for it) lay within a specific chain. But this is only really interesting for viruses, where, I believe, the best solution would be that UniProt generates separate entries for the proteolytic cleavage products. UniProt deviates from the 1 gene = 1 entry policy also for other special cases, and for these proteins it would really make a lot of sense.
Concerning P0DTD1 (polyprotein) , there were no papers worth adding when I looked, it was only few structural papers. I annotated (chain specific) for the papers showing the role of ACE2 as receptor for the spike protein of SARS2. I'm currently updating the SARS polyprotein for GO and will update SARS-COV2 accordingly (By similarity).
InterPro stores the match positions of the signatures and if one combines this info with the sequence ranges of the chains, on could determine whether a domain (and the GO terms for it) lay within a specific chain.
Yes, it wouldn't be so hard to do this. Though of course my preference is that this is done upstream of GO!
But this is only really interesting for viruses, where, I believe, the best solution would be that UniProt generates separate entries for the proteolytic cleavage products. UniProt deviates from the 1 gene = 1 entry policy also for other special cases, and for these proteins it would really make a lot of sense
Given everything I have heard in this thread, I think this could certainly make things a lot easier. These poor PRO IDs seem have a second-class existence that causes a lot of problems, if there were first-class uniprot entries for the cleavage products then a lot of things would just work as expectted.
@pmasson55 - but this shouldn't affect the GPI file. The GPI file produced by Alex should have all possible annotatable entities, regardless of whether they have annotations or not
@alexsign can you also make files for SARS-CoV. Or could combine into one coronavirus file
@cmungall do you want to have 16 entries on GPI file for https://www.uniprot.org/uniprot/P0DTD1 ? One for each PRO id regardless of annotations.
I think much better choice is to use UniProt API https://www.ebi.ac.uk/proteins/api/proteins/P0DTD1
let's discuss today
On Wed, May 13, 2020 at 7:55 AM Alex Ignatchenko notifications@github.com wrote:
@cmungall https://github.com/cmungall do you want to have 16 entries on GPI file for https://www.uniprot.org/uniprot/P0DTD1 ? One for each PRO id regardless of annotations.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-628045786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOOQUHIYWNULZZDLOZLRRKYGDANCNFSM4LSHCDWQ .
@cmungall please take a look at ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi I made additions as we discussed at the meeting. Let me know if you'd like any changes or more info in it.
thanks Alex!
Unfortunately there are a number of problems here. @kltm we should hold off on doing a new neo load as this will confuse curators.. Alex, maybe we can have a staging area for new changes so it doesn't accidentally get loaded?
I thought this new version would only introduce new cleavage products, but there are a lot more plain uniprot entries there now.
previously there was only one entry for the N nucleoprotein:
UniProtKB P0DTC9 N Nucleoprotein N protein taxon:2697049
Now there are 7:
UniProtKB A0A6C0N5E8 N Nucleoprotein N protein taxon:2697049
UniProtKB A0A6C0T6Z7 N Nucleoprotein N protein taxon:2697049
UniProtKB P0DTC9 N Nucleoprotein N protein taxon:2697049
UniProtKB A0A679GC99 N Nucleoprotein N protein taxon:2697049
UniProtKB A0A6C0WXA2 N Nucleoprotein N protein taxon:2697049
UniProtKB A0A6B9VLF5 N Nucleoprotein N protein taxon:2697049
UniProtKB A0A6B9VNN9 N Nucleoprotein N protein taxon:2697049
I don't think we want any of the A entries. These are confusing to a curator.
But it's good that we have the full set of cleavage products in here. However, we need the value of the 'Symbol' field to uniquely reflect the entry. Here we have 18 entries that all share the same symbol:
UniProtKB P0DTD1 rep Replicase polyprotein 1ab rep|1a-1b protein taxon:2697049
UniProtKB P0DTD1-PRO_0000449626 rep Non-structural protein 8 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449621 rep Non-structural protein 3 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449620 rep Non-structural protein 2 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449624 rep Non-structural protein 6 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449629 rep RNA-directed RNA polymerase rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449623 rep 3C-like proteinase rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449618 rep Replicase polyprotein 1ab rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449627 rep Non-structural protein 9 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449625 rep Non-structural protein 7 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449619 rep Host translation inhibitor nsp1 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449630 rep Helicase rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449628 rep Non-structural protein 10 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449633 rep 2'-O-methyltransferase rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449622 rep Non-structural protein 4 rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449632 rep Uridylate-specific endoribonuclease rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
UniProtKB P0DTD1-PRO_0000449631 rep Proofreading exoribonuclease rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
Here, the 2nd row should have 'nsp8' for a symbol, the 3rd row should have 'nsp3' for a symbol, etc.
Again for spike:
UniProtKB A0A6B9V081 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6C0X2H7 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UY34 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UY56 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6C0RQ44 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9XJC0 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UYI1 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6C0QGH5 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UZU2 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UZ41 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9UZ68 S Surface glycoprotein S protein taxon:2697049
UniProtKB P0DTC2 S Spike glycoprotein S|2 protein taxon:2697049
UniProtKB A0A679G9E9 S Spike glycoprotein S protein taxon:2697049
UniProtKB A0A6C0MB05 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6C0N4V2 S Surface glycoprotein S protein taxon:2697049
UniProtKB A0A6B9WHC1 S Spike glycoprotein S protein taxon:2697049
UniProtKB P0DTC2-PRO_0000449648 S Spike protein S2 S|2 protein taxon:2697049 UniProtKB:P0DTC2
UniProtKB P0DTC2-PRO_0000449646 S Spike glycoprotein S|2 protein taxon:2697049 UniProtKB:P0DTC2
UniProtKB P0DTC2-PRO_0000449649 S Spike protein S2' S|2 protein taxon:2697049 UniProtKB:P0DTC2
UniProtKB P0DTC2-PRO_0000449647 S Spike protein S1 S|2 protein taxon:2697049 UniProtKB:P0DTC2
We have 20 entries that all have the same symbol S
The A accessions should be removed, and the cleavage products should have unique symbols such as S1, S2, S2'
It might be informative to look at what PRO have done, can you make your GPI look more like this one Alex:
curl -L -s https://proconsortium.org/download/development/pro_sars2.gpi
For example, here are the entries for S and its cleavage products:
PR P0DTC2 S (SARS2) spike glycoprotein (SARS-CoV-2) S (SARS2)|S glycoprotein (SARS2)|peplomer protein (SARS2)|E2 (SARS2)|surface glycoprotein (SARS2)| protein taxon:2697049 NCBIGene:43740568
PR 000050266 S/SigPep- (SARS2) spike glycoprotein, signal peptide removed form (SARS-CoV-2) S/SigPep- (SARS2)|PRO_0000449646|UniProtKB:P0DTC2, 13-1273 protein taxon:2697049 PR:P0DTC2 NCBIGene:43740568
PR 000050267 S1 (SARS2) spike protein S1 (SARS-CoV-2) S1 (SARS2)|PRO_0000449647|UniProtKB:P0DTC2, 13-685 protein taxon:2697049 PR:P0DTC2 NCBIGene:43740568
PR 000050268 S2 (SARS2) spike protein S2 (SARS-CoV-2) S2 (SARS2)|PRO_0000449648|UniProtKB:P0DTC2, 686-1273 protein taxon:2697049 PR:P0DTC2 NCBIGene:43740568
PR 000050269 S2' (SARS2) spike protein S2' (SARS-CoV-2) S2' (SARS2)|PRO_0000449649|UniProtKB:P0DTC2, 816-1273 protein taxon:2697049 PR:P0DTC2 NCBIGene:43740568
At this stage I think it might be more straightforward for us to take the protein ontology GPI, convert the IDs to UniProt entries or cleavage PRO IDs
@cmungall I removed all "A..." accessions from the file and reposted it. I'll try to implement the rest of the requests ASAP, but it need to be coordinated with uniprot because I'm using their data to generate the file. Sorry for delay.
@cmungall Hi Chris, please check updated GPI file and let me know.
This is looking a lot better!
Some remaining issues
symbols are still not unique; e.g.
$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi | grep nsp6
UniProtKB P0DTC1-PRO_0000449640 nsp6 Non-structural protein 6 P0DTC1(3570-3859) protein taxon:2697049 UniProtKB:P0DTC1
UniProtKB P0DTD1-PRO_0000449624 nsp6 Non-structural protein 6 P0DTD1(3570-3859)|rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
it's not totally clear to me how a curator would choose between these two, they appear to be cleaved from the polyprotein at the same site?
Some of the pps still lack meaningful symbols, e.g
UniProtKB P0DTC5 P0DTC5 Membrane protein protein taxon:2697049
UniProtKB P0DTC1 P0DTC1 Replicase polyprotein 1a protein taxon:2697049
UniProtKB P0DTC8 P0DTC8 Non-structural protein 8 protein taxon:2697049
Why not call these M, 1a, and nsp8 as is conventional?
Sometimes the symbol field contains a pipe. It's not clear if your intention is that this is to be interpreted as a separator. The cardinality of this field is 1, so it's just interpreted as a string:
UniProtKB P0DTC1-PRO_0000449644 GFL|nsp10 Non-structural protein 10 P0DTC1(4254-4392) protein taxon:2697049 UniProtKB:P0DTC1
UniProtKB P0DTD1-PRO_0000449628 GFL|nsp10 Non-structural protein 10 P0DTD1(4254-4392)|rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
I would have thought nsp10
the natural name, rather than a symbol with an ugly pipe in it?
Again, it's not clear how a curator would decide between these two IDs.
@cmungall
The symbol is actually unique for the giving protein here (P0DTC1 and P0DTD1) same goes for your example 3. If you look at UniProt entries for them you can clearly see identical names for the both chains: https://www.uniprot.org/uniprot/P0DTC1 https://www.uniprot.org/uniprot/P0DTD1
Totally agree, but if you look at https://www.uniprot.org/uniprot/P0DTC5 you'll see gene name for this is N/A. if that's the case, and I don't have any other alternatives, I have to reuse accession. If I start coming up with my own names, I'm sure I'll get in trouble with UniProt pretty fast ;)
This comes from the UniProt data again: DE RecName: Full=Non-structural protein 10; DE Short=nsp10; DE AltName: Full=Growth factor-like peptide; DE Short=GFL;
Which one should be prioritised is something probably curators can answer. @pmasson55 we need your expertise on the point raised by @cmungall
My 2p, from a "user" database of UniProt:
“A... entries”:
They are not yet public, so I have to make an assumption by their AC format that these are Trembl entries. If they are Trembl ACs they should be handled the same way as any other Trembl ACs are handled where we also have a SP entry.
“Symbols”:
I believe these are the gene names/gene symbols. By convention, they are provided by the respective taxon authority, such as the HGNC for human, and probably imported or manually added by UniProt curators (@pmasson55 ?). I didn’t think UniProt had a field “PRO chain symbol”, they only give a name on the website (which I can see in the GPI). @alexsign , were did you get those logical symbols (like nsp6) from? We (at IntAct) have to add them manually (we enrich most fields in our DB for protein interactors from UniProt) so would be good to know if we can import them, too.
I also saw the entries with “N/A” as symbol. @alexsign is right, he can’t arbitrarily add something there in the GPI, it has to come from the underlying UniProt entry. I guess those can be added by UniProt curators (@pmasson55 ?). Do you need a Helpdesk ticket for the entries with missing symbols ;-)
Finally, there are 2 replicase polyproteins in each SARS sequence, R1a and R1ab. They code for the same proteins except for nsp11 and nsp12, which are only found in one ORF, respectively. It’s because the ribosome has a tendency to slip in the nsp11 sequence range resulting in a 1aa frameshift and 2 different products.
Not sure if I've been helpful ;-)
Birgit
Very helpful @bmeldal !
Tackling the "duplicate" issue first. So the fundamental issue here is that the uniprot datamodel forces each cleavage product to have a single parent. You can't have a single nsp1 shared by the two polyproteins.
IMHO this design decision is akin to saying a protein has a single transcript as parent.
But I assume it's hard to fix this. So the question is how does a curator choose which nsp1 or nsp2 etc to use? I think whether it comes from 1a or 1ab is irrelevant the majority of the time?
Do they annotate both?
Do we pick one as 'canonical/reference'? E.g the one from the longer/shorter pp?
I like what @nataled has done in PRO(tein ontology), we have a single entry for each nsp1-10, and these map to two UniProt-PRO IDs:
PR 000050279 rep/Clv:nsp10 (SARS2) non-structural protein 10 (SARS-CoV-2) rep/Clv:nsp10 (SARS2)|growth factor-like peptide (SARS2)|GFL (SARS2)|nsp10 (SARS2)|PRO_0000449644|PRO_0000449628|UniProtKB:P0DTC1, 4254-4392|UniProtKB:P0DTD1, 4254-4392 protein taxon:2697049 NCBIGene:43740578
Very helpful @bmeldal !
You are welcome.
Tackling the "duplicate" issue first. So the fundamental issue here is that the uniprot datamodel forces each cleavage product to have a single parent. You can't have a single nsp1 shared by the two polyproteins.
Correct. My guess is that this is a rare case - maybe restricted to viruses (I don't know enough viral genomes in detail to generalise this slipage phenomenon).
IMHO this design decision is akin to saying a protein has a single transcript as parent.
Well, the UniProt model "pretends" that each chain of a identical "pair" has a single, unique transcript when in fact it comes from the same transcript. We have plenty of inverse cases where we have identical proteins coded by different genes with different UniProt entries (they cause us a different problem ;-) ). Ideally, I think! we would have all PRO chains for the replicase transcripts in the same canonical entry. I don't know how the decision was made to create 2 UniProt entries for what is just one gene product. (In other cases, they merge such entries into one...)
Biology is bloody difficult to express logically!
But I assume it's hard to fix this. So the question is how does a curator choose which nsp1 or nsp2 etc to use? I think whether it comes from 1a or 1ab is irrelevant the majority of the time?
Do they annotate both?
Do we pick one as 'canonical/reference'? E.g the one from the longer/shorter pp?
I think Uniprot, as the reference resource, annotate to both entries where applicable.
We have to make a systematic decision. In IntAct, we decided to mainly annotate to the long product (R1ab) as then we can capture all but one PRO chain (nsp11) under one canonical entry. It's obviously not ideal but if we annotated to both entries where appropriate (nsp1-10) we would duplicate all these interactions. So far, I have not seen complexes involving nsp11 so all Complex Portal entries are to the long form.
I think PDBe have used the same system of annotating to the long form where possible.
I like what @nataled has done in PRO(tein ontology), we have a single entry for each nsp1-10, and these map to two UniProt-PRO IDs:
PR 000050279 rep/Clv:nsp10 (SARS2) non-structural protein 10 (SARS-CoV-2) rep/Clv:nsp10 (SARS2)|growth factor-like peptide (SARS2)|GFL (SARS2)|nsp10 (SARS2)|PRO_0000449644|PRO_0000449628|UniProtKB:P0DTC1, 4254-4392|UniProtKB:P0DTD1, 4254-4392 protein taxon:2697049 NCBIGene:43740578
It works for PRO ontology because they are a proteoform-centric ontology and not a gene product-centric encyclopaedia. How long do we have to discuss the merits of either approach ;-)
Hi Chris, Alex and Birgit,
So first point the two polyproteins. This is indeed an unusual case, concerning some viruses. They tend to do ribosomal frameshifting in order to make few replication-related proteins. We have decided to annotate both forms the same way (for the chains that are identical) and put the publications in both entries, in SwissProt. Concerning conventional GO, we also annotated both entries the same way with the publications, for SARS and SARS2, so each polyprotein (R1A and R1AB) has the same info for the identical chains. The idea is that if someone look at one of the two entries, he should have access to the all corresponding information. To resolve this issue, we plan in UniProt/SwissProt to split the polyproteins in order to have one accession number for one cleavage product. It's an ongoing project that takes time since it concerns all SwissProt entries, not only viruses... Now for the GO-CAM, I would only use one of the entries, probably the longest R1AB, which possesses all the replication proteins... Now the second point concerning the gene names, it should be fixed by us if the information is not present: I will go through all entries for SARS and SARS-2 and make sure they all have a proper gene name. For example, the membrane protein P0DTC5 should have M as gene name. I'll fix that. Concerning the point 3 -> We have: DE RecName: Full=Non-structural protein 10; DE Short=nsp10; DE AltName: Full=Growth factor-like peptide; DE Short=GFL; that gives UniProtKB P0DTC1-PRO_0000449644 GFL|nsp10 It seems that it took both short names with a weird symbol in between. If it's not too complicated I would just use the first short name which is in that case nsp10. Hope that was clear, Patrick
Thanks, Patrick.
we plan in UniProt/SwissProt to split the polyproteins in order to have one accession number for one cleavage product.
Does that mean that the R1a and R1ab entries get demerged and each nsp PRO chain gets one unique, canonical entry? Happy days for us detangling it again! Please give us a heads up when this happens ;-)
Now the second point concerning the gene names, it should be fixed by us if the information is not present: I will go through all entries
Thank you!
DE Short=nsp10;
I forgot about this line as it doesn't appear on the website. I saw it in the flat file that you released in April - and forgot again once I could use the website...
I agree, just use the first short name.
Caveat: the nsp-style short name is not always the first/recommended short name, sometimes it's the alternative short name or even an alternative FULL name (see: nsp12-nsp16 for P0DTD1). Makes it a bit confusing as the nsp-style is very easy to read and remember for human users. But I digress...
Viruses are fickle things...
Hi Birgit, the demerge of the polyproteins is a big piece of work for which we have no timeline yet, but rest assured that IntAct will be among the first to hear about it ;-) The polyprotein itself will keep its CHAIN annotations, so IntAct could transition to ACs when convenient. We will link the entries somehow (e.g. add to each FT CHAIN an xref to the AC that describes that protein in detail - the way we link has not been discussed yet, this is just one possibility).
@alexsign Apologies for the long and confusing thread (we should probably start splitting things out of here). I just wanted to follow up on https://github.com/geneontology/go-site/issues/1431#issuecomment-611799993 Would it be possible to get the GAF (ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf) available as a .gz, like the other data products we get from you?
Since this ticket has morphed from an issue to a (v useful) repository of information and discussion about IDs, I wanted to point out via @chris-grove that with the Alliance we have made a BGI file for SARS-CoV-2:
http://tazendra.caltech.edu/~azurebrd/var/work/chris/coronavirus_biogrid.json
this is "gene" centric, and has single entries for nsps etc.
With my Alliance hat on, we want to be able to project GO annotations from whatever GO chooses as the annotation unit. This will be unreliable if we don't have 1:1 mappings. E.g. if we do the conventional thing of mapping by uniprot access then annotations from one cleavage product/"gene" will transfer to others on the same pp.
@kltm Following files are available now. ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi.gz ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa.gz ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf.gz
@alexsign Great--thank you! I'll try can get these bolted in and start testing quickly.
@alexsign, @cmungall Hi, So here is the list of unique identifiers that can be used for Noctua as we discussed earlier: FROM R1AB_SARS2 (P0DTD1): nsp1 P0DTD1:PRO_0000449619 nsp2 P0DTD1:PRO_0000449620 nsp3 P0DTD1:PRO_0000449621 nsp4 P0DTD1:PRO_0000449622 nsp5 P0DTD1:PRO_0000449623 nsp6 P0DTD1:PRO_0000449624 nsp7 P0DTD1:PRO_0000449625 nsp8 P0DTD1:PRO_0000449626 nsp9 P0DTD1:PRO_0000449627 nsp10 P0DTD1:PRO_0000449628 nsp12 (Pol) P0DTD1:PRO_0000449629 nsp13 (Hel) P0DTD1:PRO_0000449630 nsp14 (exoN) P0DTD1:PRO_0000449631 nsp15 P0DTD1:PRO_0000449632 nsp16 P0DTD1:PRO_0000449633
and FROM R1A_SARS2 (P0DTC1): unique nsp11 P0DTC1:PRO_0000449645
That should solve the duplicate problem concerning the polyprotein issue.
I thought we were going to use hyphens as separators, e.g. P0DTD1-PRO_0000449633?
We are :-) this list is just for Alex to select which member of the duplicate pair to select
I would prefer this done by a field in uniprot, e.g. is_reference field, rather than a secret file, and this is coordinated with what you use, but one step at a time!
On Wed, May 27, 2020 at 11:17 AM Birgit Meldal notifications@github.com wrote:
I thought we were going to use hyphens as separators, e.g. P0DTD1-PRO_0000449633?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-634850247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJYW5JJSZZGOGZMAY3RTVKLRANCNFSM4LSHCDWQ .
@cmungall @pmasson55 Can you guys please take a look at this: https://www.uniprot.org/uniprot/P0DTD1.txt because this is the data I have an access to, and let me know how can I come up to the name above without literally typing in data into the GPI file.
@alexsign, @cmungall Hello, I'll try to answer your question Alex, hoping I understood correctly, From the entries, the best way would be to use the short name form the DE (description line) to match it with the FT chain ID. Now I see that the P0DTD1 entry should be fixed to be able to do that. For example, the nsp12 is the polymerase and we have two short names ( both are fine actually) DE RecName: Full=RNA-directed RNA polymerase; DE Short=Pol; DE Short=RdRp; In that case I could pass the second one as alternative name if that helps, then we would have pol instead of nsp12 which is fine since it's a polymerase... For nsp13 and nsp14, the short recnames are hel and ExoN which are also fine, since it's the helicase and the exonuclease. Having them like that in Noctua is fine... The only issue would be nsp15 and nsp16 that have no shortnames, but I can easily add them (they are now designed as AltName: Full=nsp15). Alex, would it be a good way and simple for you to retrieve them?
That would be great if you could add them
Will we also get nspX as synonyms? I note we are missing a lot of these
On Thu, May 28, 2020 at 9:21 AM pmasson55 notifications@github.com wrote:
@alexsign https://github.com/alexsign, @cmungall https://github.com/cmungall Hello, I'll try to answer your question Alex, hoping I understood correctly, From the entries, the best way would be to use the short name form the DE (description line) to match it with the FT chain ID. Now I see that the P0DTD1 entry should be fixed to be able to do that. For example, the nsp12 is the polymerase and we have two short names ( both are fine actually) DE RecName: Full=RNA-directed RNA polymerase; DE Short=Pol; DE Short=RdRp; In that case I could pass the second one as alternative name if that helps, then we would have pol instead of nsp12 which is fine since it's a polymerase... For nsp13 and nsp14, the short recnames are hel and ExoN which are also fine, since it's the helicase and the exonuclease. Having them like that in Noctua is fine... The only issue would be nsp15 and nsp16 that have no shortnames, but I can easily add them (they are now designed as AltName: Full=nsp15). Alex, would it be a good way and simple for you to retrieve them?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-635450609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMON4PVVV2RNEZCS4WQLRT2FLVANCNFSM4LSHCDWQ .
@pmasson55 I had a chat with UniProt production team. Any changes incorporated into next week SwissProt freeze I should be able to pull into GPI file.
@alexsign, @cmungall
In fact the entry you showed us (P0DTD1.txt) corresponds to the old version of R1AB_SARS2. To be able to fetch the latest version , go on https://www.uniprot.org/, on the top right side of the page, the link in red for coronavirus. Here is the link https://covid-19.uniprot.org/uniprotkb/P0DTD1 There, all the nsps are mentionned in shortname. Tell me if it's ok with that to be able to make your files? Thx and have a nice weekend.
FYI: We had a heldesk message today from a user asking for resolvable PRO chain IDs. I replied that it's on its way but that we have no ETA yet. Please let me know when you think the UniProt site will resolve them.
What do you mean by "resolvable PRO chain IDs"? Can you pls copy/paste the exact user question? Thanks!
Hi @redaschi
I was trying to find the part of the discussion where we debated
P0DTD1:PRO_0000449619
vs P0DTD1#PRO_0000449619
vs P0DTD1-PRO_0000449619
... can't find it, must have been in the email thread...
But what we decided was that P0DTD1-PRO_0000449619
should be a resolvable ID when people search in UniProt. It's what IntAct have been using for >15 years (ask Sandra for the history). But if you hit the website with it it can't find anything or only the canonical entry.
The user comment is:
"IntAct uses the following type of Ids/URLs: http://www.uniprot.org/uniprot/P0DTC1-PRO_0000449645
This is not a valid UniProt URL.
The part after the "-" is ignored, e.g., you can put anything there: http://www.uniprot.org/uniprot/P0DTC1-PRO_0000123456
A slightly better, but still unsatisfactory solution is to use this kind of link, which includes an anchor to the protein chain: http://www.uniprot.org/uniprot/P0DTC1#PRO_0000449645
Can you please work with UniProt to provide proper accession numbers and URLs so we can use these URLs in automated workflows?"
I was only trying to provide everyone with more info that user do need these ID to be searchable :)
hi birgit, i appreciate your informing us of user requests :) the user correctly points out that the URL he found at IntAct does not work because it is not a valid UniProt URL. The IntAct website should use the URLs with an anchor (where i hope a developer will take mercy on me and finally fix that bug). unfortunately, the user does not expalin why that URL is 'unsatisfactory' for her workflow. could you direct her to the uniprot helpdesk, please? thanks! nicole
Can we get some guidance on what the appropriate style of prefixed identifier to use here? In this ticket we've gone around having bare chain IDs (e.g UniProtKB:PRO_nnnnn), double-barreled IDs with every combination of hash, dash, colon...
hi chris, i requested a new CURIE prefix for uniprot chain identifiers at identifiers.org. they are currently processing it. hopefully you can soon use, e.g. "uniprot.chain:PRO_0000016681" and they'll resolve it to http://purl.uniprot.org/annotation/PRO_0000016681, which is the uniprot URI for chains, and uniprot.org resolves that to the correct web page (nb: the anchor problem seems to be a browser issue). i hope that helps.
OK, this is useful, it seemed earlier that we needed to keep the chain ID affixed to the parent accession, e.g. UniProtKB:Pnnnn-PRO_nnnn, but I prefer not to have composite IDs, so this is good.
It would be good to sync any changes with IntAct so that we all refer to these in the same way
Note in GO we use prefixed IDs of the form DB:LocalID. For example, UniProtKB:P08069
It seems that URL resolution will not work if we use UniProtKB:PRO_0000016681, so we'll have to add a new prefix to our registry. For consistency with identifiers.org we could just go with uniprot.chain as the prefix (conventionally we capitalize, and use underscores rather than dots, but there is nothing preventing us going with uniprot.chain). So the prefixed ID would be uniprot.chain:PRO_0000016681. Alex, would this work on your end?
On Thu, Jun 4, 2020 at 8:39 AM Nicole Redaschi notifications@github.com wrote:
hi chris, i requested a new CURIE prefix for uniprot chain identifiers at identifiers.org. they are currently processing it. hopefully you can soon use, e.g. "uniprot.chain:PRO_0000016681" and they'll resolve it to http://purl.uniprot.org/annotation/PRO_0000016681, which is the uniprot URI for chains, and uniprot.org resolves that to the correct web page (nb: the anchor problem seems to be a browser issue). i hope that helps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-638931694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJ3D3XFZADJDGMFMCLRU6543ANCNFSM4LSHCDWQ .
@redaschi @cmungall Actually, I don't like it at all. It will be inconsistent with the other UniProtKB prefixes in https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml (search for UniProtKB) It will also require special handling on the both side of the GO annotations pipeline (import/export) and change of curator tool Protein2GO, which is not going to be easy due to lack of resources. If the main concern is the link back to the website then simple replace from P12345-PRO to P12345#PRO solves it, is it not? Plus UniProtKB IDs for isoforms like P12345-2 has "-" in the annotation files already. This is certainly not my decision to make. But, at the very least for consistency I would suggest to use UniProtKB-PRO or UniProtKB-CHAIN prefix.
Another thing to consider is whether the UniProt PRO ids will be stay unique.
hi alex, chris,
i had requested the CURIE "uniprot.chain:PRO_..." from identifiers.org because i thought it could be of general use (seeing that they already had "uniprot.isoform") and that it may help to solve your problem. if it does not, the GOA project can of course make its own CURIE/xref, with e.g. a UniProtKB-Chain prefix (more similar to the prefix style you seem to have at http://amigo.geneontology.org/xrefs). you just have to take care of the resolution yourselves. as alex pointed out, it is not difficult to build a valid URL (or uniprot PURL) from AC-PRO. what you cannot do is go back from the uniprot PURL to AC-PRO.
the uniprot PRO "uniqueness" is a very interesting question: they are currently unique in the sense that each PRO is only in one uniprot entry. but as i have learned in this thread from patrick, curators (have to) assign 2 different PRO to the same protein when they create different entries for the (precursor) products of ribosomal frameshifting. this makes no sense and leads to the problem that external dbs like GOA and IntAct do not know which of the 2 PROs they should annotate. so ideally these protein chains should have only one PRO, but that means that we would have the same PRO in 2 different uniprot entries, i.e. as long as you annotate to AC-PRO instead of PRO, your problem remains the same. of course, ideally i want to make separate uniprot entries for each chain of a viral polyprotein and then you could annotate to the AC instead. but meanwhile, we'll have to find a way to muddle through with what we got.
I'll leave it to you to decide what the best form of ID is - I just want it to be consistent, within GO, and preferably with other databases like IntAct!
Note that currently the GPI is using a dash:
$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi | grep nsp3
UniProtKB P0DTC1-PRO_0000449637 PL-PRO|nsp3 Non-structural protein 3 P0DTC1(819-2763) protein taxon:2697049 UniProtKB:P0DTC1
UniProtKB P0DTD1-PRO_0000449621 PL-PRO|nsp3 Non-structural protein 3 P0DTD1(819-2763)|rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1
but the GAF is using a colon:
$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf | grep PRO_0000449637
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0004197 GO_REF:0000024 ISS UniProtKB:P0C6U8:PRO_0000338257 F Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0004197 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 F Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0019785 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 F Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039502 GO_REF:0000024 ISS UniProtKB:P0C6U8:PRO_0000338257 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039502 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039547 GO_REF:0000024 ISS UniProtKB:P0C6U8:PRO_0000338257 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039548 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039579 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039644 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039714 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 C Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0039722 GO_REF:0000024 ISS UniProtKB:P0C6U8:PRO_0000338257 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0070536 GO_REF:0000024 ISS UniProtKB:P0C6U8:PRO_0000338257 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
UniProtKB P0DTC1 P0DTC1:PRO_0000449637 GO:0071108 GO_REF:0000024 ISS UniProtKB:P0C6X7:PRO_0000037311 P Replicase polyprotein 1a protein taxon:269704920200506 UniProt
note also the GAF is missing the symbols (nsp3, etc)
Add available SARS-CoV-2 data to the pipeline
Tagging @pgaudet @cmungall
Questions: