Open cmungall opened 2 years ago
btw, kudos to @balhoff and his awesome ubergraph https://github.com/INCATools/ubergraph which made this report much easier!
this is the query I used - autogenerated so it's a bit ugly:
SELECT ?v0 ?v1 ?v2 ?v3 ?v4 ?v5 WHERE {
GRAPH <http://reasoner.renci.org/nonredundant> {
?v2 <http://purl.obolibrary.org/obo/chebi#is_conjugate_base_of>+ ?v3} .
?v6 <http://www.w3.org/2002/07/owl#annotatedSource> ?v2 .
?v6 <http://www.w3.org/2002/07/owl#annotatedProperty> ?v7 .
?v6 <http://www.w3.org/2002/07/owl#annotatedTarget> ?v0 .
?v6 <http://www.geneontology.org/formats/oboInOwl#hasDbXref> ?v8 .
?v6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Axiom> .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedProperty>) .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedSource>) .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedTarget>) .
FILTER NOT EXISTS {
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> = <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>) .
FILTER (?v8 = <http://www.w3.org/2002/07/owl#Axiom>)} .
FILTER (STR(?v8) = "UniProt") .
?v3 <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?v5 .
GRAPH <http://reasoner.renci.org/nonredundant> {
?v4 <http://purl.obolibrary.org/obo/chebi#is_conjugate_base_of>+ ?v5} .
?v9 <http://www.w3.org/2002/07/owl#annotatedSource> ?v4 .
?v9 <http://www.w3.org/2002/07/owl#annotatedProperty> ?v10 .
?v9 <http://www.w3.org/2002/07/owl#annotatedTarget> ?v1 .
?v9 <http://www.geneontology.org/formats/oboInOwl#hasDbXref> ?v11 .
?v9 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Axiom> .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedProperty>) .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedSource>) .
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> != <http://www.w3.org/2002/07/owl#annotatedTarget>) .
FILTER NOT EXISTS {
FILTER (<http://www.geneontology.org/formats/oboInOwl#hasDbXref> = <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>) .
FILTER (?v11 = <http://www.w3.org/2002/07/owl#Axiom>)} .
FILTER (STR(?v11) = "UniProt") .
FILTER NOT EXISTS {
?v2 <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?v4}}
That is awesome!
Hi @cmungall! May I ask a naive question? Why don't you use the chebi_pH7_3_mapping.tsv file (https://ftp.expasy.org/databases/rhea/tsv/chebi_pH7_3_mapping.tsv) which does the job and is provided in the Rhea distribution?
If you are starting from scratch, in addition to the is_conjugate_acid_of/is_conjugate_base_of relationships, you must also use the is_tautomer_of relationships. But in any case, none of these relationships can be used to determine microspecies at pH 7.3.
We compute the major microspecies at pH 7.3 using ChemAxon software. We are currently using a very old version of the plugin (chemaxon-marvinbeans-all-6.2.0) but we are in the process of moving forward to a new version. @muthuvenkat will be in charge of this development (it is a high priority task for us). The updated plugin provides different results/predictions. As a consequence, we will have to submit new ChEBI compounds and update the Rhea reactions accordingly.
We recently had a similar discussion with @deustp01 who asked the following question:
That leads to a specific question: what strategy do you use to decide on a charge state in cases like these – and also a more general one: is there a practical way to align practices among RHEA, ChEBI, Reactome and GO?
Our answer:
RHEA/UniProt, Reactome and GO biocurators could download and use the same version of MarvinSketch as the one used by the Swiss-Prot group in its UniProt/Rhea/ChEBI development code. By using MarvinSketch (https://chemaxon.com/products/marvin), which is free for individual, academic and non-commercial use; biocurators can compute the major microspecies at pH 7.3 (with defined parameters) then check if the 2D structure already exists in ChEBI, and if not, submit it using the ChEBI submission tool (https://www.ebi.ac.uk/chebi/submission). If some biocurators are not familiar with these tools, they can also request Rhea curators to do the job using the Rhea feed-back form (https://www.rhea-db.org/feedback).
Hope it helps!
We recently had a similar discussion with @deustp01
... triggered by our realizations that we need an authoritative single source of truth for predominant forms of ionizable molecules at pH 7.3, and that some combination of Rhea and ChEBI (unlike us) would have the expertise to provide this truth. Even if the authority uses someone else's program, like MarvinSketch, there is still a good argument for having a single authority rather than relying on all of us to maintain the same version of the program, implement it in the same way so as to guarantee identical outcomes in all cases, and interpret edge cases correctly. (This is an a priori opinion about software packages - I don't know this program so it's possible that it is robust enough to avoid these concerns.)
Hi Anne,
This issue is not about selecting the 7.3 IDs. I am assuming that is a solved problem for the purposes here, and the 7.3 form largely corresponds to the CHEBI IDs that have uniprot (biologist-friendly) synonyms.
The challenge this issue seeks to address is the fact that the 7.3 branch of CHEBI is incomplete with respect to is-a links. Unless I have made a mistake, there are 1650 missing is-a links. This means anyone using CHEBI for classification will get highly incomplete results.
On Wed, Feb 2, 2022 at 2:13 AM A. Morgat @.***> wrote:
Hi @cmungall https://github.com/cmungall! May I ask a naive question? Why don't you use the chebi_pH7_3_mapping.tsv file (https://ftp.expasy.org/databases/rhea/tsv/chebi_pH7_3_mapping.tsv) which does the job and is provided in the Rhea distribution?
If you are starting from scratch, in addition to the is_conjugate_acid_of/is_conjugate_base_of relationships, you must also use the is_tautomer_of relationships. But in any case, none of these relationships can be used to determine microspecies at pH 7.3.
We compute the major microspecies at pH 7.3 using ChemAxon software. We are currently using a very old version of the plugin (chemaxon-marvinbeans-all-6.2.0) but we are in the process of moving forward to a new version. @muthuvenkat https://github.com/muthuvenkat will be in charge of this development (it is a high priority task for us). The updated plugin provides different results/predictions. As a consequence, we will have to submit new ChEBI compounds and update the Rhea reactions accordingly.
We recently had a similar discussion with @deustp01 https://github.com/deustp01 who asked the following question:
That leads to a specific question: what strategy do you use to decide on a charge state in cases like these – and also a more general one: is there a practical way to align practices among RHEA, ChEBI, Reactome and GO?
Our answer:
RHEA/UniProt, Reactome and GO biocurators could download and use the same version of MarvinSketch as the one used by the Swiss-Prot group in its UniProt/Rhea/ChEBI development code. By using MarvinSketch (https://chemaxon.com/products/marvin), which is free for individual, academic and non-commercial use; biocurators can compute the major microspecies at pH 7.3 (with defined parameters) then check if the 2D structure already exists in ChEBI, and if not, submit it using the ChEBI submission tool (https://www.ebi.ac.uk/chebi/submission). If some biocurators are not familiar with these tools, they can also request Rhea curators to do the job using the Rhea feed-back form ( https://www.rhea-db.org/feedback).
Hope it helps!
— Reply to this email directly, view it on GitHub https://github.com/ebi-chebi/ChEBI/issues/4207#issuecomment-1027779430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOPI76P3OK2I36YXRXDUZD7URANCNFSM5NKIMIVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Here is an example visualized. I think there should be an is-a link between the two pink nodes, based on the fact there is an is-a link (blank) between their two conjugate-base-of siblings:
(green CBO = conjugate base of)
See also: http://purl.obolibrary.org/obo/CHEBI_143089 which shows the chain length
and pubchem agrees: https://pubchem.ncbi.nlm.nih.gov/compound/10-Hydroxystearic-acid "It is a hydroxy fatty acid and a long-chain fatty acid."
(interestingly they attribute this to CHEBI so perhaps they are doing their own inference over the conjugate bases)
Sorry for my misunderstanding ;-)
Yes, there are many missing relationships in ChEBI, at least 1'650 thanks to your analysis :-( So this question is for the ChEBI curators (@amalik01). How to deal with these missing links and do a systematic analysis to make the classification more robust. For UniProt/Rhea, @parit indexes the parent/compound classes for all protonation states (thus bypassing some missing relationships).
no problem!
so it sounds like you/@parit are essentially doing a post-processing step to make a more complete hierarchy, functionally this is the same as the GCI approach in GO. It would be great if we didn't all do our own post-processing but could feed this back into CHEBI!
If the terms are already present in ChEBI, then i can always add in the missing links if you send me a list with the missing is_a terms for the ChEBI ID's. However to completely fix this issue we actually need to create alot of new terms for the ionized structures since alot of the higher terms are missing from the ChEBI hierarchy since they are usually submitted by our users. I am not sure whether there is an automated way to create these missing terms since it will be an extremely time-consuming process if i start to fix it manually.
Hi @amalik01 here is the link again, see above for the format, happy to give it in a different format, also happy to help get you set up to run the sparql query as part of your normal QC process:
https://github.com/chemkg/chemrof/blob/master/chebi-scratch/conjugate-ph7-3-incomplete.tsv
I can also help you with automated approaches to find both missing links and missing named protonated forms. My preferred approach would be to do this as part of a general moving of CHEBI towards leveraging OWL and OWL definitions over the next few years, as this will be most sustainable, and other members of the OBO community can help. But if that is not feasible with your resources then maybe some canned QC queries would be best.
Is there anything else you need for this @amalik01 ?
@cmungall With regards to adding the missing links - its seems that i have to go through the ChEBI hierarchy in order to see where the missing links are for alot of the ChEBI identifiers. Is there any simple way where you can just send me the ChEBI ID of the parent entry and inform me which direct relationship is missing from this particular ChEBI ID (E.g is a, is tautomer of, is conjugate acid of, is conjugate base of etc). It will save me a lot of time time.
Lets take the 4th example you gave in the table, 2,3-dihydroxybenzoate (CHEBI:36654) is missing an is_a relationship to aromatic carboxylate (CHEBI:91007). In order to add this relationship i need to go up the ChEBI hierarchy and add this term to its parent entry benzoates (CHEBI:22718) is_a aromatic carboxylate. Benzoates is also missing a conjugate base relationship to benzoic acids.
any progress on this?
GO ontology #27059 looks relevant. Not sure that it's progress - more like an independent encounter with the same missing-links problem.
See also ChEBI issue 3823 "Document use of has_major_microspecies_at_pH_7_3 relationship type and how to traverse is-a in CHEBI "
Background: CHEBI has terms for different protonated forms of acids and bases, but does not have grouping classes that conflate over protonation state. Many groups like GO and RHEA use terms in the ph7.3 branch, and use the "uniprot" names.
Unfortunately, the parallel protonation state branches are often inconsistent and incomplete. This means that GO is forced to go through complex gymnastics to get complete inferences as described in https://pubmed.ncbi.nlm.nih.gov/23895341/. This complexity makes it hard for us and our users.
We would like to switch to using the ph7.3 branch and essentially ignoring the parallel branches. Currently this results in many missing links. I have included these in the report below.
The report categorizes two variants of missing is-a links between 7.3 terms; one where the 7.3 pair is the conjugate base of a pair of is-a path pairs, and the other where the 7.3 pair is the conjugate acid:
question marks denote a missing is-a path.
Full report: https://github.com/chemkg/chemrof/blob/master/chebi-scratch/conjugate-ph7-3-incomplete.tsv
Here is a sample of the report (too large to include directly as an issue)
Here are the first two columns (note these are the uniprot names, which do NOT correspond to the primary labels) click to expand:
For the full report see:
https://github.com/chemkg/chemrof/blob/master/chebi-scratch/conjugate-ph7-3-incomplete.tsv