Closed kltm closed 1 year ago
@kltm you want a GPI file with only the one viral species ?
We previously discussed just the SARS-CoV-2 genome, but we could extend to the other coronavirus genomes. But let's do the SARS-CoV-2 genome first
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi
Great! @thomaspd indicated we may need the isoforms also - is this something that will require further upstream protein sequence curation in uniprot?
@cmungall I don't think we have this data in uniprot private release yet.
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa
@cmungall Your neo with https://github.com/geneontology/neo/pull/55 changes are failing with:
18:18:26 make: *** No rule to make target 'target/neo-goa_sars-cov-2.obo', needed by 'all_obo'. Stop.
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf
@cmungall I'm thinking now that the issues are around here:
datasets.json: trigger
wget http://s3.amazonaws.com/go-public/metadata/datasets.json -O $@ && touch $@
and
Makefile-gafs: datasets.json
./build-neo-makefile.py -i $< > $@.tmp && mv $@.tmp $@
Given this, without starting the rewrite of the
@kltm - recall that my changes to the Makefile hardcoded the URLs for the virus
from @lpalbou on gitter (better too report here than gitter, where it won't get lost). FAO @alexsign
looking at the GPAD of covid: ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa we have two annotations for the same gene:
UniProtKB P0DTC2 part_of GO:0055036 GO_REF:0000044 ECO:0000322 UniProtKB-SubCell:SL-0275 20200321 UniProt go_evidence=IEA
UniProtKB P0DTC2:PRO_0000449647 enables GO:0005515 PMID:32132184 ECO:0000353 UniProtKB:Q9BYF1 20200320 UniProt go_evidence=IPI
But one is specified with a PRO and the other no. Is it legit ? Does it mean the second one should be treated as a different isoform maybe ?
My answer: no this looks like a bug.
@cmungall Yes, but the generation of the neo-* targets seems to be entirely done through the datasets.json metadata ball, which seems to be the origin of the error when doing the full build.
@cmungall it's a legit manual annotation done by Patrick.Masson@isb-sib.ch at SIB.
Hi,
That 's not a bug and you might see several of these discrepancies. That's the problem we have for viral polyproteins. For example the coronavirus: https://www.uniprot.org/uniprot/P0C6X7.txt These ployproteins are cleaved once synthesized, leading to the generation of 10 to 15 viral proteins. Since they are post-transcriptional cleavage products, they are represented with only one accession number ( AC:P0C6X7 in this case). The problem is that if we use GO with this accession ( and we did that at the beginning of GO annotation) you end up with polyproteins that contain all the annotations of the 15 viral products, which doesn t mean anything at the end. If we just take the cellular component example, if half of the proteins are cytoplasmic and half are nuclear, that the polyprotein entry will have cytoplasm and nucleus annotation, which doesn't tell much at the end. Instead, we started using PROID to tag specific components of the polyprotein. So we can assign terms much better. In addition, we can still assign term to the full polyprotein ( using just AC:P0C6X7) if necessary as we sometimes have information about the uncleaved polyprotein (function or localization before cleavage) so you can have both, just the Accession number and the accession number with a PRO ID. Hope this is more clear now, I ll follow if you have more questions.
It would be important to be able to capture that.
Important: Note that everyone reading this thread should be aware the that UniProt PRO IDs have nothing to do with the PRO ontology
Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently.
We can extend this to use the uniprot chain/pro IDs, but there are challenges
First, these don't seem to be resolvable. Using our existing regsitered prefixes the prefixed ID UniProtKB:P0DTC2:PRO_0000449647 would be resolved as:
https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647
but this is a 404
I would prefer to avoid double-barreled prefixes. Should the prefixed ID not be UniProtKB:PRO_0000449647
but this also fails to resolve:
https://www.uniprot.org/uniprot/PRO_0000449647
There is also the problem that because some organisms use the PRO ontology, even if we are well-behaved prefixes, we will cause massive confusion to our community by using the uniprot PRO and the PRO ontology at the same time.
Also remember every distinct entry in the gpa should be in the gpi, I don't see the PRO ID in the gpi...
I agree that it would be ideal if there were a way to resolve the UniProt PRO IDs, and we can ask the UniProt team (Maria, maybe?) how that might be done. But even if it's not done yet, it would be very helpful to have a GPI file that has the UniProt PRO ID, and lists the parent ID of the UniProt polyprotein record. From what Patrick had told me, the PRO ID (the chain within the polyprotein) has a name associated with it, so we could also get that in column 4 of the GPI file: FT CHAIN 1..180 FT /note="Host translation inhibitor nsp1" FT /evidence="ECO:0000250" FT /id="PRO_0000037309"
@alexsign, would adding a line to the GPI file for each chain be doable?
My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is PRO_nnnnn
), and having the uniprot resolves reolve https://www.uniprot.org/uniprot/PRO_0000449647
@alexsign any comments on these suggestions ?
@cmungall Hi Chris, The annotations of the UniProt "PRO" IDs (chains, ploy-peptides and so on) are not the new thing. We produce and publish this kind data for a while. From my part I see no issue to include an extra line into the GPI file if this will really help. I assume you need something like this: UniProtKB P0DTC2 S Spike glycoprotein S|2 protein taxon:2697049 UniProtKB P0DTC2 :PRO_0000449647 S Spike glycoprotein S|2 protein taxon:2697049
Now about the links on UniProt website. Please keep in mind this is PRE-release data, so even simple link like https://www.uniprot.org/uniprot/P0DTC2 will not work. The data simply not there in the current public release of the website. The next UniProt website public release, which will have sars-cov-2 data, is on April 22nd. The link should work after this date. The UniProt consortium understands importance of this data and created the data portal everyone to use. Please try the following link to see it. https://covid-19.uniprot.org/uniprotkb/P0DTC2
If links are absolutely must, you have to strip :PRO... ids from the link (this is how it's done in QuickGO), or replace ":" to "#" symbol. For a time being you have to use https://covid-19.uniprot.org/uniprotkb/ before the actual ID instead of https://www.uniprot.org/uniprot/.
Both links
https://covid-19.uniprot.org/uniprotkb/P0DTC2
and
https://covid-19.uniprot.org/uniprotkb/P0DTC2#PRO_0000449647
works the same way right now.
I have an issue opened with our web development team to make P0DTC2#PRO_0000449647 link scroll to PRO ID information as well.
I understand it's not ideal, and I open to an alternative suggestions.
Making links like
https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647
work might be possible in the future as well, but it needs to be discussed with the web developers.
Unfortunately, your preferred link https://www.uniprot.org/uniprot/PRO_0000449647 might never work for UniProt because "PRO" IDs are part of the protein record and not a separate entity, and not indexed as such. They simply provide you with extra information.
We can discuss changes we can make from both sides to make this data public ASAP without changing too much in our pipelines. The UniProtKB syntax from db-xrefs.yaml for now is: idsyntax: ([OPQ][0-9][A-Z0-9]{3}[0-9]|A-NR-Z{1,2}[0-9])((-[0-9]+)|:PRO[0-9]{10}|:VAR_[0-9]{6}){0,1}
I'd recommend against the use of ':' within the local identifier, is it possible to use something else, e.g. dash, underscore
Are the viral SRAS/SARS2 genomes available for GO-CAM? I tried UniProtKB:P0DTC2 and it seems that it doesn't recognize it.
Update:
we want to annotate two genomes, @alexsign can we also include a GPI for SARS-CoV, as well as SARS-CoV-2.
polyproteins: while we don't like annotating to the chain IDs we will do this as a temporary measure, until these are made into bona-fide entries. Can you go ahead and add this to the GPIs @alexsign? Thanks! Use whatever separator is easiest but my preference is still for an alternative to :
@cmungall The reference proteome data for SARS-CoV, is publicly available on ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/55962.H_SARS_coronavirus.goa as for GPI and GPAD would you like reference proteome of everything for tax_id:694009 ?
not so long ago I create the following for @pgaudet ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi grep "taxon:694009" uniprot_reviewed_virus_bacteria.gpi will do the trick, unless you want combined SARS-CoV* files posted for somewhere.
as for the separator, if there are no objections to "#" ? I'll investigate what can be done here with the the GOA pipeline, and our collaborators and users.
@kltm: @pmasson55 is asking when the entries in the GPI will be available for annotation
@cmungall @pmasson55 These should now be available in the autocomplete for Noctua.
@cmungall @pmasson55 These should now be available in the autocomplete for Noctua.
Since when ? Patrick tried today and it wasn’t available.
I just checked and I think there are several issues:
Only 5 of the 13 IDs provided are available in the autocomplete:
P0DTD2 is the label. See the gpi.
$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi | grep P0DTD2 | cut -f3
P0DTD2
P0DTC7 is not working for instance.
I see it
We got them in late last week. All entities provided in the GPI are in the system, but the reality of how some of that reflects in the system appears to be a little lacking. If you put scov2 into the search, they should all be coming up. Some of the entities have labels like "6", which is not super helpful. They can be seen here http://noctua-amigo.berkeleybop.org/amigo/search/ontology by searching for scov2 and removing the "go" filter.
Correct, everything is working as expected, the entries in the GPI file are available for search.
I agree some of the symbols/labels are a bit suboptimal
@alexsign why is '6' used as the symbol? I believe the more conventional symbol is nsp6. This is what NCBI uses: https://www.ncbi.nlm.nih.gov/protein/YP_009725302.1
This also aligns with what we use on http://www.geneontology.xyz/covid-19.html
Here is a suggestion - how would we feel about using PR for identifiers, at least as an interim measure?
They have a GPI here:
https://proconsortium.org/download/development/pro_sars2.gpi
Also obo:
https://proconsortium.org/download/development/pro_sars2.obo
@cmungall did you tested on noctua.geneontology.org ? I tried on both Safari and Chrome on two different computers and my iPad and no result:
Noctua Form:
Graph Editor:
The autocomplete with P0DT still only yields 5 results.
P0DTD2 is the label. See the gpi.
Meant the name "Protein 9b" in this case. "Scov2" was not in the GPI.
If you put scov2 into the search, they should all be coming up
I can find 10 / 11 IDs on noctua.geneontology.org using "Scov2", so @pmasson55 you can use that workaround for the moment to annotate most virus genes. @kltm :
As a side note, those IDs don't seem to be loaded in golr-aux.geneontology.io and golr.geneontology.io:
If that's the case, it's unclear to me to have partial release and I don't think there is anything to document that at the moment.
Some of the entities have labels like "6", which is not super helpful.
Agreed. @alexsign @pmasson55 any chance to get a better label ? If would improve the search
let's avoid overloading this ticket with unrelated issues. I chose the suffix. Note all entries in neo have a species suffix this is not new.
Sorry for brevity in reviews most of today
The use of species suffixes to constrain search on gps is documented here: http://wiki.geneontology.org/index.php/Noctua#3.a._Enter_gene_product_or_macromolecular_complex_to_be_annotated
did you tested on noctua.geneontology.org ?
yep!
I tried on both Safari and Chrome on two different computers and my iPad and no result:
I think there is a straightforward explanation. The behavior of AC in Form and GE differ. (I do not know why the GE AC code was not reused.)
Form does not search on the localId (the part of the ID after the ':', i.e col2 of the GPI). You have to AC on the symbol, which works:
In some cases, the GPI uses a localId like P0DTD2 as a symbol, so of course these show in AC.
GE searches on localId
To summarize, the system is working with the sars-cov-2 gpi in exactly the same way it works with other gpis.
Things can certainly be improved from a user-perspective. This includes making more standard and searchable labels (e.g nsp6, see my prev comments; also make "ORF9b" be the label instead of "P0DTD2"). Other improvements are outside the scope of this ticket (standardizing AC behavior across Noctua, making all AC search on localId).
Sorry this has become one of those tickets with multiple hard to follow threads.
This is addressed at @alexsign
-
as the internal separator in the localId, as this is what IntAct uses.For example, the nsp11 protein has the local ID P0DTC1-PRO_0000449645
here:
<interactor id="3674141">
<names>
<shortLabel>nsp11_wcpv</shortLabel>
<fullName>Non-structural protein 11</fullName>
</names>
<xref>
<primaryRef db="uniprotkb" dbAc="MI:0486" id="P0DTC1-PRO_0000449645" refType="identity" refTypeAc="MI:0356"/>
<secondaryRef db="intact" dbAc="MI:0469" id="EBI-25496252" refType="chain-parent" refTypeAc="MI:0951"/>
<secondaryRef db="intact" dbAc="MI:0469" id="EBI-25475882" refType="identity" refTypeAc="MI:0356"/>
</xref>
There seems to be a little confusion with the Noctua GOlr load contents and how those are getting exposed. Hopefully we can start sorting some of that out, but for the time being, we have a new release of NEO that's been loaded into the Noctua GOlr that should have the local ids available as synonyms, which should hopefully correct some of the behaviors above. Tagging @lpalbou @cmungall
@cmungall @lpalbou 6 is what I have in the database. You can also see it here: https://covid-19.uniprot.org/uniprotkb/P0DTC6
@cmungall ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprotsars-cov-2.gpa updated with ACCESSION-PRO... identifies
Thanks! Can you also fill in the parent field (see PRO file for reference)
On Wed, Apr 8, 2020 at 7:28 AM Alex Ignatchenko notifications@github.com wrote:
@cmungall https://github.com/cmungall ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprotsars-cov-2.gpa updated with ACCESSION-PRO... identifies
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-610991205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONWKBIFQNO3QOY4Z6TRLSCSFANCNFSM4LSHCDWQ .
@cmungall the uniprot_sars-cov-2.gpi updated with parent_object_id
Thanks! Can you remove the self-parents?
On Wed, Apr 8, 2020 at 8:28 AM Alex Ignatchenko notifications@github.com wrote:
@cmungall https://github.com/cmungall the uniprot_sars-cov-2.gpi updated with parent_object_id
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-611025647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJVPUE4BT4SFW3SYB3RLSJZDANCNFSM4LSHCDWQ .
@cmungall not sure if I understand right. do you want to remove parent_objectid from record with no PRO... ids?
Yes, e.g this one:
UniProtKB P0DTC1 P0DTC1 Replicase polyprotein 1a protein taxon:2697049 UniProtKB:P0DTC1
On Wed, Apr 8, 2020 at 9:16 AM Alex Ignatchenko notifications@github.com wrote:
@cmungall https://github.com/cmungall not sure if I understand right. do you want to remove parent_objectid from record with no PRO... ids?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-611052024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOPETXWHTCMGMIVX7KTRLSPLTANCNFSM4LSHCDWQ .
@cmungall the file is updated now
@alexsign Apologies, but could you make the GAF (ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf) available as a .gz
, like the other data products we get from you?
Reviewing where we are at here.
This is what uniprot is providing for sars-cov-2:
UniProtKB P0DTC6 6 Non-structural protein 6 6 protein taxon:2697049
UniProtKB A0A663DJA2 ORF10 ORF10 protein ORF10 protein taxon:2697049
UniProtKB P0DTC9 N Nucleoprotein N protein taxon:2697049
UniProtKB P0DTD3 ORF14 Uncharacterized protein 14 ORF14 protein taxon:2697049
UniProtKB P0DTD2 P0DTD2 Protein 9b protein taxon:2697049
UniProtKB P0DTC7 7a Protein 7a 7a protein taxon:2697049
UniProtKB P0DTC2 S Spike glycoprotein S|2 protein taxon:2697049
UniProtKB P0DTC4 E Envelope small membrane protein E|4 protein taxon:2697049
UniProtKB P0DTD1 rep Replicase polyprotein 1ab rep|1a-1b protein taxon:2697049
UniProtKB P0DTC5 P0DTC5 Membrane protein protein taxon:2697049
UniProtKB P0DTC1 P0DTC1 Replicase polyprotein 1a protein taxon:2697049
UniProtKB P0DTD8 P0DTD8 Protein non-structural 7b protein taxon:2697049
UniProtKB P0DTC8 P0DTC8 Non-structural protein 8 protein taxon:2697049
UniProtKB P0DTC3 3a Protein 3a 3a protein taxon:2697049
UniProtKB P0DTC2-PRO_0000449647 S Spike glycoprotein S|2 protein taxon:2697049 UniProtKB:P0DTC2
These are all available for autocomplete in Noctua
Some things that can be improved:
Also, @alexsign can we have this for SARS-CoV as well?
Add available SARS-CoV-2 data to the pipeline
Tagging @pgaudet @cmungall
Questions: