Add available coronavirus data to the pipeline

kltm commented 4 years ago

Add available SARS-CoV-2 data to the pipeline

[x] @alexsign to produce GPI file
[x] add to NEO (change config line in Makefile)
[x] @alexsign to produce GAF/GPAD (this will be mostly interpro2go etc to start with)
[ ] Add GAF/GPAD to yaml, so can be loaded into amigo, added to release files
[ ] Patrick/ViralZone will do a GO-CAM for SARS-CoV-2
[ ] This should naturally flow into GO-CAM site. @lpalbou look into a way to highlight
[ ] @alexsign to load GPADs emanating from GO-CAMs back into GOA
- [ ] UPDATE 2020-05-13 do the same for SARS-CoV

Tagging @pgaudet @cmungall

Questions:

@alexsign is it easy for you to give us the GPI in advance of the proteins going in to uniprot main release? If not, it is trivial for us to parse the xml from ftp://ftp.uniprot.org/pub/databases/uniprot/pre_release/
@alexsign will the GPI include all of the isoforms? It looks from coronavirus.xml on the EBI FTP site at the moment there is only accessions for the GPCR proteins

pgaudet commented 4 years ago

@kltm you want a GPI file with only the one viral species ?

cmungall commented 4 years ago

We previously discussed just the SARS-CoV-2 genome, but we could extend to the other coronavirus genomes. But let's do the SARS-CoV-2 genome first

alexsign commented 4 years ago

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi

cmungall commented 4 years ago

Great! @thomaspd indicated we may need the isoforms also - is this something that will require further upstream protein sequence curation in uniprot?

alexsign commented 4 years ago

@cmungall I don't think we have this data in uniprot private release yet.

alexsign commented 4 years ago

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa

kltm commented 4 years ago

@cmungall Your neo with https://github.com/geneontology/neo/pull/55 changes are failing with:

18:18:26  make: *** No rule to make target 'target/neo-goa_sars-cov-2.obo', needed by 'all_obo'.  Stop.

alexsign commented 4 years ago

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf

kltm commented 4 years ago

@cmungall I'm thinking now that the issues are around here:

datasets.json: trigger
    wget http://s3.amazonaws.com/go-public/metadata/datasets.json -O $@ && touch $@

and

Makefile-gafs: datasets.json
    ./build-neo-makefile.py -i $< > $@.tmp && mv $@.tmp $@

Given this, without starting the rewrite of the

[ ] remove your changes from the Makefile
[ ] ~~get datasets.json into the main pipeline (maybe under different name)~~
[ ] ~~point this Makefile at the new correct upstream~~
[ ] add the COVID-19 GPI metadata
[ ] rerun (and generate new Makefile-gafs) once upstream dataset is updated

cmungall commented 4 years ago

@kltm - recall that my changes to the Makefile hardcoded the URLs for the virus

cmungall commented 4 years ago

from @lpalbou on gitter (better too report here than gitter, where it won't get lost). FAO @alexsign

looking at the GPAD of covid: ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa we have two annotations for the same gene:

UniProtKB    P0DTC2    part_of    GO:0055036    GO_REF:0000044    ECO:0000322    UniProtKB-SubCell:SL-0275        20200321    UniProt        go_evidence=IEA
UniProtKB    P0DTC2:PRO_0000449647    enables    GO:0005515    PMID:32132184    ECO:0000353    UniProtKB:Q9BYF1        20200320    UniProt        go_evidence=IPI

But one is specified with a PRO and the other no. Is it legit ? Does it mean the second one should be treated as a different isoform maybe ?

My answer: no this looks like a bug.

kltm commented 4 years ago

@cmungall Yes, but the generation of the neo-* targets seems to be entirely done through the datasets.json metadata ball, which seems to be the origin of the error when doing the full build.

alexsign commented 4 years ago

@cmungall it's a legit manual annotation done by Patrick.Masson@isb-sib.ch at SIB.

pmasson55 commented 4 years ago

Hi,

That 's not a bug and you might see several of these discrepancies. That's the problem we have for viral polyproteins. For example the coronavirus: https://www.uniprot.org/uniprot/P0C6X7.txt These ployproteins are cleaved once synthesized, leading to the generation of 10 to 15 viral proteins. Since they are post-transcriptional cleavage products, they are represented with only one accession number ( AC:P0C6X7 in this case). The problem is that if we use GO with this accession ( and we did that at the beginning of GO annotation) you end up with polyproteins that contain all the annotations of the 15 viral products, which doesn t mean anything at the end. If we just take the cellular component example, if half of the proteins are cytoplasmic and half are nuclear, that the polyprotein entry will have cytoplasm and nucleus annotation, which doesn't tell much at the end. Instead, we started using PROID to tag specific components of the polyprotein. So we can assign terms much better. In addition, we can still assign term to the full polyprotein ( using just AC:P0C6X7) if necessary as we sometimes have information about the uncleaved polyprotein (function or localization before cleavage) so you can have both, just the Accession number and the accession number with a PRO ID. Hope this is more clear now, I ll follow if you have more questions.

pgaudet commented 4 years ago

It would be important to be able to capture that.

cmungall commented 4 years ago

Important: Note that everyone reading this thread should be aware the that UniProt PRO IDs have nothing to do with the PRO ontology

Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently.

We can extend this to use the uniprot chain/pro IDs, but there are challenges

First, these don't seem to be resolvable. Using our existing regsitered prefixes the prefixed ID UniProtKB:P0DTC2:PRO_0000449647 would be resolved as:

https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647

but this is a 404

I would prefer to avoid double-barreled prefixes. Should the prefixed ID not be UniProtKB:PRO_0000449647

but this also fails to resolve:

https://www.uniprot.org/uniprot/PRO_0000449647

There is also the problem that because some organisms use the PRO ontology, even if we are well-behaved prefixes, we will cause massive confusion to our community by using the uniprot PRO and the PRO ontology at the same time.

cmungall commented 4 years ago

Also remember every distinct entry in the gpa should be in the gpi, I don't see the PRO ID in the gpi...

thomaspd commented 4 years ago

I agree that it would be ideal if there were a way to resolve the UniProt PRO IDs, and we can ask the UniProt team (Maria, maybe?) how that might be done. But even if it's not done yet, it would be very helpful to have a GPI file that has the UniProt PRO ID, and lists the parent ID of the UniProt polyprotein record. From what Patrick had told me, the PRO ID (the chain within the polyprotein) has a name associated with it, so we could also get that in column 4 of the GPI file: FT CHAIN 1..180 FT /note="Host translation inhibitor nsp1" FT /evidence="ECO:0000250" FT /id="PRO_0000037309"

@alexsign, would adding a line to the GPI file for each chain be doable?

cmungall commented 4 years ago

My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is PRO_nnnnn), and having the uniprot resolves reolve https://www.uniprot.org/uniprot/PRO_0000449647

pgaudet commented 4 years ago

@alexsign any comments on these suggestions ?

alexsign commented 4 years ago

@cmungall Hi Chris, The annotations of the UniProt "PRO" IDs (chains, ploy-peptides and so on) are not the new thing. We produce and publish this kind data for a while. From my part I see no issue to include an extra line into the GPI file if this will really help. I assume you need something like this: UniProtKB P0DTC2 S Spike glycoprotein S|2 protein taxon:2697049 UniProtKB P0DTC2 :PRO_0000449647 S Spike glycoprotein S|2 protein taxon:2697049

Now about the links on UniProt website. Please keep in mind this is PRE-release data, so even simple link like https://www.uniprot.org/uniprot/P0DTC2 will not work. The data simply not there in the current public release of the website. The next UniProt website public release, which will have sars-cov-2 data, is on April 22nd. The link should work after this date. The UniProt consortium understands importance of this data and created the data portal everyone to use. Please try the following link to see it. https://covid-19.uniprot.org/uniprotkb/P0DTC2

If links are absolutely must, you have to strip :PRO... ids from the link (this is how it's done in QuickGO), or replace ":" to "#" symbol. For a time being you have to use https://covid-19.uniprot.org/uniprotkb/ before the actual ID instead of https://www.uniprot.org/uniprot/. Both links https://covid-19.uniprot.org/uniprotkb/P0DTC2 and https://covid-19.uniprot.org/uniprotkb/P0DTC2#PRO_0000449647 works the same way right now.
I have an issue opened with our web development team to make P0DTC2#PRO_0000449647 link scroll to PRO ID information as well. I understand it's not ideal, and I open to an alternative suggestions.

Making links like
https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647 work might be possible in the future as well, but it needs to be discussed with the web developers.

Unfortunately, your preferred link https://www.uniprot.org/uniprot/PRO_0000449647 might never work for UniProt because "PRO" IDs are part of the protein record and not a separate entity, and not indexed as such. They simply provide you with extra information.

We can discuss changes we can make from both sides to make this data public ASAP without changing too much in our pipelines. The UniProtKB syntax from db-xrefs.yaml for now is: idsyntax: ([OPQ][0-9][A-Z0-9]{3}[0-9]|A-NR-Z{1,2}[0-9])((-[0-9]+)|:PRO[0-9]{10}|:VAR_[0-9]{6}){0,1}

cmungall commented 4 years ago

I'd recommend against the use of ':' within the local identifier, is it possible to use something else, e.g. dash, underscore

pmasson55 commented 4 years ago

Are the viral SRAS/SARS2 genomes available for GO-CAM? I tried UniProtKB:P0DTC2 and it seems that it doesn't recognize it.

cmungall commented 4 years ago

Update:

we want to annotate two genomes, @alexsign can we also include a GPI for SARS-CoV, as well as SARS-CoV-2.

polyproteins: while we don't like annotating to the chain IDs we will do this as a temporary measure, until these are made into bona-fide entries. Can you go ahead and add this to the GPIs @alexsign? Thanks! Use whatever separator is easiest but my preference is still for an alternative to :

alexsign commented 4 years ago

@cmungall The reference proteome data for SARS-CoV, is publicly available on ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/55962.H_SARS_coronavirus.goa as for GPI and GPAD would you like reference proteome of everything for tax_id:694009 ?

not so long ago I create the following for @pgaudet ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi grep "taxon:694009" uniprot_reviewed_virus_bacteria.gpi will do the trick, unless you want combined SARS-CoV* files posted for somewhere.

as for the separator, if there are no objections to "#" ? I'll investigate what can be done here with the the GOA pipeline, and our collaborators and users.

cmungall commented 4 years ago

@kltm: @pmasson55 is asking when the entries in the GPI will be available for annotation

kltm commented 4 years ago

@cmungall @pmasson55 These should now be available in the autocomplete for Noctua.

lpalbou commented 4 years ago

@cmungall @pmasson55 These should now be available in the autocomplete for Noctua.

Since when ? Patrick tried today and it wasn’t available.

lpalbou commented 4 years ago

I just checked and I think there are several issues:

the search only work by id (eg P0DTD2), not label
not all ids provided in the covid GPI have been ingested ! P0DTC7 is not working for instance.

Only 5 of the 13 IDs provided are available in the autocomplete:

cmungall commented 4 years ago

P0DTD2 is the label. See the gpi.

$ curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi  | grep P0DTD2 | cut -f3
P0DTD2

cmungall commented 4 years ago

P0DTC7 is not working for instance.

I see it

kltm commented 4 years ago

We got them in late last week. All entities provided in the GPI are in the system, but the reality of how some of that reflects in the system appears to be a little lacking. If you put scov2 into the search, they should all be coming up. Some of the entities have labels like "6", which is not super helpful. They can be seen here http://noctua-amigo.berkeleybop.org/amigo/search/ontology by searching for scov2 and removing the "go" filter.

cmungall commented 4 years ago

Correct, everything is working as expected, the entries in the GPI file are available for search.

I agree some of the symbols/labels are a bit suboptimal

@alexsign why is '6' used as the symbol? I believe the more conventional symbol is nsp6. This is what NCBI uses: https://www.ncbi.nlm.nih.gov/protein/YP_009725302.1

This also aligns with what we use on http://www.geneontology.xyz/covid-19.html

cmungall commented 4 years ago

Here is a suggestion - how would we feel about using PR for identifiers, at least as an interim measure?

They have a GPI here:

https://proconsortium.org/download/development/pro_sars2.gpi

Also obo:

https://proconsortium.org/download/development/pro_sars2.obo

lpalbou commented 4 years ago

@cmungall did you tested on noctua.geneontology.org ? I tried on both Safari and Chrome on two different computers and my iPad and no result:

Noctua Form:

Graph Editor:

The autocomplete with P0DT still only yields 5 results.

P0DTD2 is the label. See the gpi.

Meant the name "Protein 9b" in this case. "Scov2" was not in the GPI.

If you put scov2 into the search, they should all be coming up

I can find 10 / 11 IDs on noctua.geneontology.org using "Scov2", so @pmasson55 you can use that workaround for the moment to annotate most virus genes. @kltm :

we really need to investigate golr search queries, it should definitely find the ids
I have asked @tmushayahama to increase the results shown in the noctua form autocomplete to 50 (I am guessing that's why the 11th is not showing): https://github.com/geneontology/noctua-form/issues/87
@kltm could you also increase the limit of results shown to 50 in noctua graph ? It's been a long standing issue affecting a lot of curators that could not find their gene. It's also probably why you are only showing 10/11 IDs using "Scov2" keyword
also where did that "Scov2" come from and how a user would know about it ? It's not in the GPI
even with that trick, P0DTC7 is still missing, at least on noctua.geneontology.org

As a side note, those IDs don't seem to be loaded in golr-aux.geneontology.io and golr.geneontology.io:

golr.geneontology.io

golr-aux.geneontology.io

If that's the case, it's unclear to me to have partial release and I don't think there is anything to document that at the moment.

Some of the entities have labels like "6", which is not super helpful.

Agreed. @alexsign @pmasson55 any chance to get a better label ? If would improve the search

cmungall commented 4 years ago

let's avoid overloading this ticket with unrelated issues. I chose the suffix. Note all entries in neo have a species suffix this is not new.

Sorry for brevity in reviews most of today

cmungall commented 4 years ago

The use of species suffixes to constrain search on gps is documented here: http://wiki.geneontology.org/index.php/Noctua#3.a._Enter_gene_product_or_macromolecular_complex_to_be_annotated

cmungall commented 4 years ago

did you tested on noctua.geneontology.org ?

yep!

I tried on both Safari and Chrome on two different computers and my iPad and no result:

I think there is a straightforward explanation. The behavior of AC in Form and GE differ. (I do not know why the GE AC code was not reused.)

Form does not search on the localId (the part of the ID after the ':', i.e col2 of the GPI). You have to AC on the symbol, which works:

In some cases, the GPI uses a localId like P0DTD2 as a symbol, so of course these show in AC.

GE searches on localId

To summarize, the system is working with the sars-cov-2 gpi in exactly the same way it works with other gpis.

Things can certainly be improved from a user-perspective. This includes making more standard and searchable labels (e.g nsp6, see my prev comments; also make "ORF9b" be the label instead of "P0DTD2"). Other improvements are outside the scope of this ticket (standardizing AC behavior across Noctua, making all AC search on localId).

cmungall commented 4 years ago

Sorry this has become one of those tickets with multiple hard to follow threads.

This is addressed at @alexsign

Can we go ahead with including the polyproteins in the GPI, using the uniprot PRO chain IDs
I would like to use - as the internal separator in the localId, as this is what IntAct uses.

For example, the nsp11 protein has the local ID P0DTC1-PRO_0000449645 here:

           <interactor id="3674141">
                <names>
                    <shortLabel>nsp11_wcpv</shortLabel>
                    <fullName>Non-structural protein 11</fullName>
                </names>
                <xref>
                    <primaryRef db="uniprotkb" dbAc="MI:0486" id="P0DTC1-PRO_0000449645" refType="identity" refTypeAc="MI:0356"/>
                    <secondaryRef db="intact" dbAc="MI:0469" id="EBI-25496252" refType="chain-parent" refTypeAc="MI:0951"/>
                    <secondaryRef db="intact" dbAc="MI:0469" id="EBI-25475882" refType="identity" refTypeAc="MI:0356"/>
                </xref>

kltm commented 4 years ago

There seems to be a little confusion with the Noctua GOlr load contents and how those are getting exposed. Hopefully we can start sorting some of that out, but for the time being, we have a new release of NEO that's been loaded into the Noctua GOlr that should have the local ids available as synonyms, which should hopefully correct some of the behaviors above. Tagging @lpalbou @cmungall

alexsign commented 4 years ago

@cmungall @lpalbou 6 is what I have in the database. You can also see it here: https://covid-19.uniprot.org/uniprotkb/P0DTC6

alexsign commented 4 years ago

@cmungall ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprotsars-cov-2.gpa updated with ACCESSION-PRO... identifies

cmungall commented 4 years ago

Thanks! Can you also fill in the parent field (see PRO file for reference)

On Wed, Apr 8, 2020 at 7:28 AM Alex Ignatchenko notifications@github.com wrote:

@cmungall https://github.com/cmungall ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprotsars-cov-2.gpa updated with ACCESSION-PRO... identifies

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-610991205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONWKBIFQNO3QOY4Z6TRLSCSFANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@cmungall the uniprot_sars-cov-2.gpi updated with parent_object_id

cmungall commented 4 years ago

Thanks! Can you remove the self-parents?

On Wed, Apr 8, 2020 at 8:28 AM Alex Ignatchenko notifications@github.com wrote:

@cmungall https://github.com/cmungall the uniprot_sars-cov-2.gpi updated with parent_object_id

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-611025647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJVPUE4BT4SFW3SYB3RLSJZDANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@cmungall not sure if I understand right. do you want to remove parent_objectid from record with no PRO... ids?

cmungall commented 4 years ago

Yes, e.g this one:

UniProtKB P0DTC1 P0DTC1 Replicase polyprotein 1a protein taxon:2697049 UniProtKB:P0DTC1

On Wed, Apr 8, 2020 at 9:16 AM Alex Ignatchenko notifications@github.com wrote:

@cmungall https://github.com/cmungall not sure if I understand right. do you want to remove parent_objectid from record with no PRO... ids?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-site/issues/1431#issuecomment-611052024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOPETXWHTCMGMIVX7KTRLSPLTANCNFSM4LSHCDWQ .

alexsign commented 4 years ago

@cmungall the file is updated now

kltm commented 4 years ago

@alexsign Apologies, but could you make the GAF (ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf) available as a .gz, like the other data products we get from you?

cmungall commented 4 years ago

Reviewing where we are at here.

This is what uniprot is providing for sars-cov-2:

UniProtKB       P0DTC6  6       Non-structural protein 6        6       protein taxon:2697049                   
UniProtKB       A0A663DJA2      ORF10   ORF10 protein   ORF10   protein taxon:2697049                   
UniProtKB       P0DTC9  N       Nucleoprotein   N       protein taxon:2697049   
UniProtKB       P0DTD3  ORF14   Uncharacterized protein 14      ORF14   protein taxon:2697049                   
UniProtKB       P0DTD2  P0DTD2  Protein 9b              protein taxon:2697049   
UniProtKB       P0DTC7  7a      Protein 7a      7a      protein taxon:2697049   
UniProtKB       P0DTC2  S       Spike glycoprotein      S|2     protein taxon:2697049                   
UniProtKB       P0DTC4  E       Envelope small membrane protein E|4     protein taxon:2697049                   
UniProtKB       P0DTD1  rep     Replicase polyprotein 1ab       rep|1a-1b       protein taxon:2697049                   
UniProtKB       P0DTC5  P0DTC5  Membrane protein                protein taxon:2697049                   
UniProtKB       P0DTC1  P0DTC1  Replicase polyprotein 1a                protein taxon:2697049                   
UniProtKB       P0DTD8  P0DTD8  Protein non-structural 7b               protein taxon:2697049                   
UniProtKB       P0DTC8  P0DTC8  Non-structural protein 8                protein taxon:2697049                   
UniProtKB       P0DTC3  3a      Protein 3a      3a      protein taxon:2697049   
UniProtKB       P0DTC2-PRO_0000449647   S       Spike glycoprotein      S|2     protein taxon:2697049   UniProtKB:P0DTC2

These are all available for autocomplete in Noctua

Some things that can be improved:

Do not use the accession in the symbol field. For example, for "Membrane protein", replace col3 P0DTC5 with "M" or "M protein"
Use more standard symbols in column 3. Instead of "6", use "nsp6"
Ideally symbol should be unique. We have two entries labeled "S"
@pmasson55 Is this complete? I see only one chain ID in there, P0DTC2-PRO_0000449647.

Also, @alexsign can we have this for SARS-CoV as well?

geneontology / go-site

Add available coronavirus data to the pipeline #1431