geneontology / neo

noctua entity ontology
9 stars 2 forks source link

ID / name collisions in ecocyc and goa_sars-cov-2 against uniprot_reviewed_virus_bacteria causing problem in NEO pipeline #80

Closed kltm closed 2 years ago

kltm commented 2 years ago

The NEO pipeline is now failing with the following error:

"multiple name tags not allowed"

Originating error: 10:49:24 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:P17846 id( UniProtKB:P17846)synonym( cysI RELATED)synonym( cysI BROAD)synonym( P17846 RELATED)synonym( b2763 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( cysI NCBITaxon:83333)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)synonym( cysI BROAD[NCBITaxon:83333 ])name( cysI ecocyc)synonym( sulfite reductase hemoprotein subunit ecocyc EXACT)synonym( JW2733 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:83333)is_a( CHEBI:33695))

In gene_association.ecocyc.gz, the triggering line seems to be:

UniProtKB   P17846  cysI    part_of GO:0009337  PMID:21873635   IBA PANTHER:PTN001353165|SGD:S000003898|UniProtKB:P17846    C"sulfite reductase, hemoprotein subunit"   cysI|b2763|ECK2758  proteintaxon:83333  20200213    GO_Central      PR:P17846

Tagging @pgaudet @vanaukenk @balhoff

pgaudet commented 2 years ago

@kltm I thought you were filtering IBAs coming from upstreams?

Is thus for Ecocyc to fix, or for PAINT?

kltm commented 2 years ago

@pgaudet Filtering IBAs is what happens for the "main" GO pipeline as part of applying the GO rules, not for this NEO pipeline, which essentially just takes a set of files, OBO-ifies them, and turns them into an ontology for the autocomplete to run off of--there are no rules or filters run on this input. NEO is all annotatable entities, so we likely do not want to filter things for violations in the same way that the "main" pipeline does.

Since this was introduced recently and seems to be an actual issue, I feel that this is something that we'd want the upstream to take care of (unsure if this would be in their processing or in PAINT). If necessary, we could start trying to filter things, but I'd be rather uneasy about that.

Alternatively, if ecocyc had a GPI available, we could switch over to that (essentially what we're doing by wringing out the GAF).

kltm commented 2 years ago

@pgaudet Just wanted to follow up on this in a little more detail. The actual issue here is identifier collision and what owltools is doing with the OBO, not in the GAF directly, so IBA or not doesn't really matter. The problematic stanza in obo in ecocyc (not really, see below) is:

[Term]
id: UniProtKB:P17846
name: cysI ecocyc
synonym: "sulfite reductase hemoprotein subunit ecocyc" EXACT []
synonym: "cysI" BROAD [NCBITaxon:83333]
is_a: CHEBI:33695 ! information biomacromolecule
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
relationship: in_taxon NCBITaxon:83333

What I think might actually be going on here is that there is a conflict with incompatible tags (name) appearing in neo-uniprot_reviewed_virus_bacteria.obo as well:

id: UniProtKB:P17846
name: cysI NCBITaxon:83333
synonym: "cysI" BROAD []
synonym: "cysI" RELATED []
synonym: "JW2733" RELATED []
synonym: "b2763" RELATED []
synonym: "P17846" RELATED []
is_a: CHEBI:36080 ! protein
relationship: in_taxon NCBITaxon:83333
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein

From this, I think the solution is to drop either file. Doing a little experimentation (below), I think that there may be thousands of other issues in using both of these files at the same time that we just haven't had the chance to run into yet.

Find identifier intersection:

grep "id: " neo-uniprot_reviewed_virus_bacteria.obo | sort > /tmp/ids_rev.txt
grep "id: " neo-ecocyc.obo | sort > /tmp/ids_eco.txt
comm -12 /tmp/ids_eco.txt  /tmp/ids_rev.txt | wc -l
3895
kltm commented 2 years ago

Also tagging in @cmungall , as we are now getting back into looking at (removal|inclusion|filtering|merging) sources. Assuming that I'm reading this right, this may just be an extension of #77 .

cmungall commented 2 years ago

let's just drop the ecocyc GPI from neo

kltm commented 2 years ago

Removed (noting that ecocyc was a GAF, not a GPI). Now testing.

kltm commented 2 years ago

@cmungall I think you can see where this is going...

grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt
comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l
14

The shared IDs between neo-goa_sars-cov-2.obo and neo-uniprot_reviewed_virus_bacteria.obo are:

id: UniProtKB:P0DTC1
id: UniProtKB:P0DTC2
id: UniProtKB:P0DTC3
id: UniProtKB:P0DTC4
id: UniProtKB:P0DTC5
id: UniProtKB:P0DTC6
id: UniProtKB:P0DTC7
id: UniProtKB:P0DTC8
id: UniProtKB:P0DTC9
id: UniProtKB:P0DTD1
id: UniProtKB:P0DTD2
id: UniProtKB:P0DTD3
id: UniProtKB:P0DTD8

Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else?

cmungall commented 2 years ago

We have a separate ticket on that one. I still think the hand curated GPI that Marcin did is better for curators but if the virus group is happy to switch, and has conventions to magically choose the right protein I'm OK.

On Thu, Feb 3, 2022 at 4:19 PM kltm @.***> wrote:

@cmungall https://github.com/cmungall I think you can see where this is going...

grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l 14

The shared IDs between neo-goa_sars-cov-2.obo and neo-uniprot_reviewed_virus_bacteria.obo are:

id: UniProtKB:P0DTC1 id: UniProtKB:P0DTC2 id: UniProtKB:P0DTC3 id: UniProtKB:P0DTC4 id: UniProtKB:P0DTC5 id: UniProtKB:P0DTC6 id: UniProtKB:P0DTC7 id: UniProtKB:P0DTC8 id: UniProtKB:P0DTC9 id: UniProtKB:P0DTD1 id: UniProtKB:P0DTD2 id: UniProtKB:P0DTD3 id: UniProtKB:P0DTD8

Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else?

— Reply to this email directly, view it on GitHub https://github.com/geneontology/neo/issues/80#issuecomment-1029523340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJIDP5GZUTQMP6VBD3UZMLRLANCNFSM5NM6YF7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

kltm commented 2 years ago

I believe you're referring to https://github.com/geneontology/go-site/issues/1431 ?

kltm commented 2 years ago

@cmungall Specifically from there https://github.com/geneontology/go-site/issues/1431#issuecomment-650388423 . Okay, as a simple workaround for the moment, should we edit the hand-curated kg-covid file to remove those 14 items and revisit this later on as part of https://github.com/geneontology/go-site/issues/1431 ?

kltm commented 2 years ago

@cmungall For example https://github.com/Knowledge-Graph-Hub/kg-covid-19/pull/440 (feel free to close). Basically, it's all the "normal" IDs in that file.

cmungall commented 2 years ago

no editing of the hand curated file, it's good, and it's used elsewhere

Just remove it from the load for now, we can revisit later, just let Patrick know when it's done

On Thu, Feb 3, 2022 at 4:54 PM kltm @.***> wrote:

@cmungall https://github.com/cmungall Specifically from there geneontology/go-site#1431 (comment) https://github.com/geneontology/go-site/issues/1431#issuecomment-650388423 . Okay, as a simple workaround for the moment, should we edit the hand-curated kg-covid file to remove those 14 items and revisit this later on as part of geneontology/go-site#1431 https://github.com/geneontology/go-site/issues/1431 ?

— Reply to this email directly, view it on GitHub https://github.com/geneontology/neo/issues/80#issuecomment-1029539761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONUVA4O4LXWN4ZPW5LUZMPTTANCNFSM5NM6YF7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

kltm commented 2 years ago

From thread with @cmungall and @pgaudet switching to uniprot_reviewed_virus_bacteria over kg-covid.

pgaudet commented 2 years ago

For Sars-CoV2 we'd like to keep the old file, as fixed by @cmungall

Is that OK? Where do we specify this, should we create a virus.yaml file for this (and other viruses that we might fix in the future)?

Thanks, Pascale

kltm commented 2 years ago

Okay, to be honest, I'm a little confused about the current state of desires here.

As of this moment, we are loading: sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc

What file is to be loaded in addition to this? And this file has been fixed so that it no longer collides with what we're currently loading?

pgaudet commented 2 years ago

Currently not a problem anymore if we dont load the viruses and bacteria-reviewed file (#77)