Closed kltm closed 2 years ago
@kltm I thought you were filtering IBAs coming from upstreams?
Is thus for Ecocyc to fix, or for PAINT?
@pgaudet Filtering IBAs is what happens for the "main" GO pipeline as part of applying the GO rules, not for this NEO pipeline, which essentially just takes a set of files, OBO-ifies them, and turns them into an ontology for the autocomplete to run off of--there are no rules or filters run on this input. NEO is all annotatable entities, so we likely do not want to filter things for violations in the same way that the "main" pipeline does.
Since this was introduced recently and seems to be an actual issue, I feel that this is something that we'd want the upstream to take care of (unsure if this would be in their processing or in PAINT). If necessary, we could start trying to filter things, but I'd be rather uneasy about that.
Alternatively, if ecocyc had a GPI available, we could switch over to that (essentially what we're doing by wringing out the GAF).
@pgaudet Just wanted to follow up on this in a little more detail. The actual issue here is identifier collision and what owltools is doing with the OBO, not in the GAF directly, so IBA or not doesn't really matter. The problematic stanza in obo in ecocyc (not really, see below) is:
[Term]
id: UniProtKB:P17846
name: cysI ecocyc
synonym: "sulfite reductase hemoprotein subunit ecocyc" EXACT []
synonym: "cysI" BROAD [NCBITaxon:83333]
is_a: CHEBI:33695 ! information biomacromolecule
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
relationship: in_taxon NCBITaxon:83333
What I think might actually be going on here is that there is a conflict with incompatible tags (name) appearing in neo-uniprot_reviewed_virus_bacteria.obo
as well:
id: UniProtKB:P17846
name: cysI NCBITaxon:83333
synonym: "cysI" BROAD []
synonym: "cysI" RELATED []
synonym: "JW2733" RELATED []
synonym: "b2763" RELATED []
synonym: "P17846" RELATED []
is_a: CHEBI:36080 ! protein
relationship: in_taxon NCBITaxon:83333
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein
From this, I think the solution is to drop either file. Doing a little experimentation (below), I think that there may be thousands of other issues in using both of these files at the same time that we just haven't had the chance to run into yet.
Find identifier intersection:
grep "id: " neo-uniprot_reviewed_virus_bacteria.obo | sort > /tmp/ids_rev.txt
grep "id: " neo-ecocyc.obo | sort > /tmp/ids_eco.txt
comm -12 /tmp/ids_eco.txt /tmp/ids_rev.txt | wc -l
3895
Also tagging in @cmungall , as we are now getting back into looking at (removal|inclusion|filtering|merging) sources. Assuming that I'm reading this right, this may just be an extension of #77 .
let's just drop the ecocyc GPI from neo
Removed (noting that ecocyc was a GAF, not a GPI). Now testing.
@cmungall I think you can see where this is going...
grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt
comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l
14
The shared IDs between neo-goa_sars-cov-2.obo
and neo-uniprot_reviewed_virus_bacteria.obo
are:
id: UniProtKB:P0DTC1
id: UniProtKB:P0DTC2
id: UniProtKB:P0DTC3
id: UniProtKB:P0DTC4
id: UniProtKB:P0DTC5
id: UniProtKB:P0DTC6
id: UniProtKB:P0DTC7
id: UniProtKB:P0DTC8
id: UniProtKB:P0DTC9
id: UniProtKB:P0DTD1
id: UniProtKB:P0DTD2
id: UniProtKB:P0DTD3
id: UniProtKB:P0DTD8
Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else?
We have a separate ticket on that one. I still think the hand curated GPI that Marcin did is better for curators but if the virus group is happy to switch, and has conventions to magically choose the right protein I'm OK.
On Thu, Feb 3, 2022 at 4:19 PM kltm @.***> wrote:
@cmungall https://github.com/cmungall I think you can see where this is going...
grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l 14
The shared IDs between neo-goa_sars-cov-2.obo and neo-uniprot_reviewed_virus_bacteria.obo are:
id: UniProtKB:P0DTC1 id: UniProtKB:P0DTC2 id: UniProtKB:P0DTC3 id: UniProtKB:P0DTC4 id: UniProtKB:P0DTC5 id: UniProtKB:P0DTC6 id: UniProtKB:P0DTC7 id: UniProtKB:P0DTC8 id: UniProtKB:P0DTC9 id: UniProtKB:P0DTD1 id: UniProtKB:P0DTD2 id: UniProtKB:P0DTD3 id: UniProtKB:P0DTD8
Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else?
— Reply to this email directly, view it on GitHub https://github.com/geneontology/neo/issues/80#issuecomment-1029523340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOJIDP5GZUTQMP6VBD3UZMLRLANCNFSM5NM6YF7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
I believe you're referring to https://github.com/geneontology/go-site/issues/1431 ?
@cmungall Specifically from there https://github.com/geneontology/go-site/issues/1431#issuecomment-650388423 . Okay, as a simple workaround for the moment, should we edit the hand-curated kg-covid file to remove those 14 items and revisit this later on as part of https://github.com/geneontology/go-site/issues/1431 ?
@cmungall For example https://github.com/Knowledge-Graph-Hub/kg-covid-19/pull/440 (feel free to close). Basically, it's all the "normal" IDs in that file.
no editing of the hand curated file, it's good, and it's used elsewhere
Just remove it from the load for now, we can revisit later, just let Patrick know when it's done
On Thu, Feb 3, 2022 at 4:54 PM kltm @.***> wrote:
@cmungall https://github.com/cmungall Specifically from there geneontology/go-site#1431 (comment) https://github.com/geneontology/go-site/issues/1431#issuecomment-650388423 . Okay, as a simple workaround for the moment, should we edit the hand-curated kg-covid file to remove those 14 items and revisit this later on as part of geneontology/go-site#1431 https://github.com/geneontology/go-site/issues/1431 ?
— Reply to this email directly, view it on GitHub https://github.com/geneontology/neo/issues/80#issuecomment-1029539761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMONUVA4O4LXWN4ZPW5LUZMPTTANCNFSM5NM6YF7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
From thread with @cmungall and @pgaudet switching to uniprot_reviewed_virus_bacteria over kg-covid.
For Sars-CoV2 we'd like to keep the old file, as fixed by @cmungall
Is that OK? Where do we specify this, should we create a virus.yaml file for this (and other viruses that we might fix in the future)?
Thanks, Pascale
Okay, to be honest, I'm a little confused about the current state of desires here.
As of this moment, we are loading: sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc
What file is to be loaded in addition to this? And this file has been fixed so that it no longer collides with what we're currently loading?
Currently not a problem anymore if we dont load the viruses and bacteria-reviewed file (#77)
The NEO pipeline is now failing with the following error:
"multiple name tags not allowed"
Originating error:
10:49:24 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:P17846 id( UniProtKB:P17846)synonym( cysI RELATED)synonym( cysI BROAD)synonym( P17846 RELATED)synonym( b2763 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( cysI NCBITaxon:83333)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)synonym( cysI BROAD[NCBITaxon:83333 ])name( cysI ecocyc)synonym( sulfite reductase hemoprotein subunit ecocyc EXACT)synonym( JW2733 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:83333)is_a( CHEBI:33695))
In gene_association.ecocyc.gz, the triggering line seems to be:
Tagging @pgaudet @vanaukenk @balhoff