ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
44 stars 10 forks source link

Problematic nodes/edges #4335

Closed sorenwacker closed 6 months ago

sorenwacker commented 1 year ago

Hi,

following up my email conversation with @amalik01.

I created this tool to mine the ChEBI graph and identified some problematic Nodes/Edges. And suggested to remove some of them. Apparently the suggestions were already helpful. I am trying to find groups of ChEBI nodes which belong to the same structure (Enantiomers, Tautomers) and to select one representative structure for each one. To do so, I am using the is_enantiomer, is_tautomer and is_a edges in both directions (incoming and outgoing).

image

First, I remove nodes which belong to patterns of compounds, indicated by SMILES strings with * or not present strings.

The problem that I am facing is that some structures are linked with an is_a link which actually should not be linked. For example, I had to remove all the di- and other oligo-peptides because they would always point to the mono-peptide.

Here are the largest subgraphs that I get.

CHEBI:175256 124 CHEBI:192579 79 CHEBI:61313 65 CHEBI:33313 30 CHEBI:64961 30 CHEBI:18019 28 CHEBI:184013 28 CHEBI:86071 25 CHEBI:188921 23 CHEBI:17719 19 CHEBI:29708 17 CHEBI:15971 17 CHEBI:49140 15 CHEBI:16375 15

For the sake of not overwhelming you, I only report the most severe cases.

Many nodes are linked to 23053 catechin (graph CHEBI:175256).

image Howver, this is given a specific SMILES string, but CHEBI:183094 is linked to be an instance of it

image

In reality, 183094 contains 23053, but 23053 has no R group. They are actually two different compounds.

The problem is that particular instances and patterns (groups) of compounds are not properly identifieable in the database. Which comes from linguistinc ambiguity. When chemists talk, they probably know from the context if they mean a particular structure or a pattern. The database should carfully distinguish between those. An entity should be an instance or a group, but not both. And there should be another kind of link 'is_derivative' (maybe), or 'contains' to link peptides and di-peptides. That would be my suggestion to solve this.

The way I see it is that Try-Ala is not a Try, but Try-Ala contains Try. This would make it easier to mine the graph. Right know, I have to use a lot of heuristics to do it properly.

I created the subgraphs of the above mentioned entities in my repository.

https://github.com/sorenwacker/chebi-tools in the analytics/Problematic-Nodes-Edges sub-folder.