ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
39 stars 10 forks source link

Document why some xrefs are on some protonation states and not others #4492

Open cmungall opened 3 months ago

cmungall commented 3 months ago

Let me know if I should bundle these into an epic ticket. This is related to #4482. I like to break things into more manageable chunks. You might also want to consider grouping tickets into github projects or applying common labels.

For this issue I assume everyone reading is familiar with handling of protonation in CHEBI, and that the "biologist-friendly" state can be achieved by some combination of the RHEA ph7.3 mapping file and/or uniprot synonyms.

I think there might an assumption that xrefs are normally placed on the biologist-friendly state, so I did an analysis. I took all entries in the rhea file (https://ftp.expasy.org/databases/rhea/tsv/chebi_pH7_3_mapping.tsv), I excluded singletons to get a true "protonation-variable" set, and then looked at number of xrefs on a term to see if there was any correlation between whether xrefs are placed on the biologist-friendly state or another state.

It turns out that the non-stable forms tend to have the xrefs a little more on average:

image

The general pattern seems to be the same when we look at individual xref sources like KEGG

This adds additional complication to trying to roll up information to a protonation-agnostic form.

For example, for the biological concept of "glycine" (nominally an amino acid), GO and RHEA now both use the concept "glycine zwitterion" (CHEBI:57305) rather than the acid glycine (CHEBI:15428), because this is in the rhea ph7.3 mapping fule. RHEA will also substitute the name "glycine zwitterion" with "glycine" making the situation both simpler and harder ("glycine" is the normal biological name for the protonation-agnostic chemical, but "glycine" is the primary label for a different ID in CHEBI!)

Now on top of this, if we want to get biologist-useful mappings as well as biologist-friendly labels we have to look across protonation-specific concepts:

CHEBI ID CHEBI label UniProt label stable at 7.3 MetaCyc xrefs KEGG xrefs
CHEBI:57305 glycine zwitterion glycine Y GLY -
CHEBI:15428 glycine - N GLY C00037,D00011

(in this case it's the acid that carries the majority of the xrefs)

It's not clear if there is a rationale why the MetaCyc xref is applied to both and the KEGG is on the non-canonical form. You could argue that the KEGG one in on the acid because BRITE classifies it as an acid but I think this reasoning would be incorrect, as I think many of the conceptual resources CHEBI maps to has a protonation-agnostic conception.

My hypothesis is that the placement is somewhat arbitrary or at least doesn't make a useful distinction, and groups like GO would be justified in rolling the mappings across protonation states - is that correct, or would there be a danger in doing that?

amalik01 commented 3 months ago

In ChEBI, we form cross-references to other databases based on exact synonym and structure matching. The reason why the MetaCyc x-ref is added to both glycine zwitterion and glycine entries is because it contains a structure that matches the zwitterion entry but contains synonyms that match the glycine entry, hence we add the x-ref to both entries.

MetaCyc entry:

image

Many of the databases ChEBI maps to are focused on neutral compounds rather than their ionized forms so this is the main reason why the neutral forms have more x-refs than their ionized forms. A small group of databases (E.g. KEGG, MetaCyc etc) do contain ionised structures but use synonyms for the neutral form.

For example in ChEBI there is a clear distinction between pyruvic acid (CHEBI:32816) and its conjugate base pyruvate (CHEBI:15361). We will not add the synonym 'pyruvic acid' to the latter since it would be incorrect. However KEGG (https://www.genome.jp/dbget-bin/www_bget?cpd:C00022) will call it both pyruvic acid and pyruvate.

However, I do agree with you that there would be no harm replicating the x-refs across all protonation states of a compound.

cmungall commented 3 months ago

Thanks @amalik01. So indeed it's not particularly meaningful

Can you also confirm the rules for deciding which of the protonated forms to apply has-role X metabolite relationships to?

amalik01 commented 3 months ago

@cmungall. The roles in ChEBI are currently assigned based on evidence found in the primary literature (peer reviewed articles). In the past, we may have added the _hasrole metabolite relationship based on data found in other databases such as HMDB and MetaboLights (After discussions with the MetaboLights team, this practice has now stopped since there is no guarantee that a compound annotated from a metabolomics experiment is actually a metabolite of the species being studied).

Most of the roles are added to the neutral compounds rather than their ionized forms unless the name of the ionized form is specifically mentioned in the primary literature.