ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
42 stars 10 forks source link

Create an official biologist-friendly version of CHEBI that conflates protonation groups to a representative member #4482

Open cmungall opened 7 months ago

cmungall commented 7 months ago

Currently CHEBI has individual classes for protonated forms of chemical entities, but oddly does not create grouping class. This would be like an anatomy ontology including terms for "left hand", "right hand", but not "hand".

Groups like GO and RHEA have developed workarounds for this, where the ph7.3 stable form is taken as representative. This needs to be documented:

(Note that RHEA goes further and essentially relabels the chemical entity to a protonation-agnostic term. So for example, lysynium(1+) is relabeled to lysine and L-glutamate(1-) is relabeled to L-glutamate.)

Unfortunately, this strategy is not sufficient for making a subset. To see why, look at the is-a ancestry of the representatives for Lysine and L-Glutamate:

image

The common ancestor of these terms is "polyatomic ion". There is no common amino acid parent. This is because there is no protonation agnostic term for amino acid!

See these slides for more examples

Really what CHEBI should be doing is making new IDs that represent protonation agnostic forms. These are analogous to terms like "hand" and "foot" in an ontology like Uberon. This would massively simplify the ontology and eliminate long-standing issues of inconsistent classification such as #4207.

In the absence of this, the next best thing is a CHEBI subset that collapses all protonation groups into a single representative, and fills in correct is-as, as if this were the representative.

In the above example, there would be one term for lysine and its conjugates, one term for glutamate and its conjugates, and one term for amino acid and its conjugates, one term for alpha amini acid and its conjugates, and simple is-as between them making a simple tree:

(omitting L and D forms for simplicity here but this is a whole other issue)

The algorithm for going from CHEBI to CHEBI-simple would be

  1. create a mapping between every CHEBI term and a ConjSet consisting of that term and everything reachable by the 2 conjugate predicates (many terms form degenerates sets of size 1, this is fine)
  2. select a representative for each ConjSet
    • use ph73 if available (e.g. lysinium(1+))
    • if not available for that term, use the RHEA mapping
    • if not available for that term, use the generic acid form (e.g. amino acid)
  3. relabel the representative using UNIPROT_SYNONYM if available
  4. For each pair of ConjSets S1 and S2, if there exist any member m1 in S1, m2 in S2, such that m1 is-a m2, then make an is-a axiom `repr(S1) is-a repr(S2)
  5. Discard all non-representative members
  6. remove all charge annotations on the representative members

This will create a biologist friendly version of CHEBI that will just work without doing any of the complicated procedures in place in RHEA or GO. It will look much more similar to the ontologies in use by MetaCyc, KEGG.

Note that this subset will still be problematic. It's like making an ontology like Uberon that has laterality-neutral terms like "hand" from an ontology that has laterality-only terms by arbitrarily picking left or right forms. It will include is-a relationships that are inconsistent with the source CHEBI file, where differently charged molecules are in SubClassOf relationships. But CHEBI already has these and no one seems to care (#4393).

Just to emphasize: this should never have been necessary. CHEBI is meant to be "Chemical entities of biological interest". No biologist ever asked for the current protonation strategy in CHEBI. Protonation-neutral forms should be first-class CHEBI IDs. However, in the absence of this, the "representative conjugate set member" approach is the best that can be done.

bpeters42 commented 7 months ago

I couldn't agree more with Chris' suggestions. This would be a straightforward approach to start to fix some of the more glaring problems for biologists trying to use Chebi. I am not adding anything new, but wanted to emphasize how important this is beyond a thumbs-up.

cmungall commented 6 months ago

@amalik01 I was wondering if you or other members of the CHEBI team have had time to consider this proposal?

amalik01 commented 6 months ago

@cmungall

I agree with you – there are a lot of inconsistencies in the way the ChEBI hierarchy is built for the neutral and ionised molecules.

At the database level, this would be difficult to do as we also have the conjugate acid/base relationships that exist between the neutral and ionised molecules. A quick solution would be to fill in the missing conjugate acid/base relationships between the terms higher up in the ontology.

If you and others just require a biological friendly OWL/OBO file laid out using the set of rules described, then this can be done and made available. I will discuss this with the ChEBI team this Thursday and get back to you. We may need to arrange a Zoom call with you to discuss further.

cmungall commented 6 months ago

Thanks! A call would be great

matentzn commented 3 months ago

Any chance of this happening?

cmungall commented 3 months ago

Call happening today, my slides are here https://docs.google.com/presentation/d/1R3NRzH70ERjwebqecgt2OYKC8sIuB1_Xs7ENyWAXgjc/edit#slide=id.p

StroemPhi commented 1 month ago

From the NFDI4Chem perspective, the fixes proposed by @cmungall from slide 46 onwards would help a lot in the reuse of ChEBI.