ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
39 stars 10 forks source link

SPARQL query found multiple potentially duplicated terms #4365

Open shawntanzk opened 1 year ago

shawntanzk commented 1 year ago

Hi ChEBI team,

based on https://github.com/ebi-chebi/ChEBI/issues/4364 I decided to run some queries on ubergraph to check for more potential duplicates. I've spot checked a few terms from the queries below and visualised a few on graphs to ensure my query does not miss out anything, and it seems to be accurate (I might be wrong, as I'm no the greatest as SPARQL and I dont fully understand ChEBI and all). I'm also not a chemist so I'll leave this for you to figure out which is actually duplicated. Happy to provide any details etc. should you need. Happy to provide the list of terms too, but I figured you can get it from the yasgui endpoint attached either way.

Query 1 - https://api.triplydb.com/s/ct_cqGEzF Checking all terms that have same inchi, inichikey, smiles, have no relations between them, and are not alt id of each other Results: 84 pairs

SPARQL Query 1 ```sparql PREFIX owl: PREFIX inchi: PREFIX inchikey: PREFIX dcterms: PREFIX obo: PREFIX rdfs: PREFIX smiles: PREFIX alt: SELECT ?t1 ?t2 ?p WHERE { ?t1 rdfs:isDefinedBy . ?t2 rdfs:isDefinedBy . ?t1 inchi: ?inchi . ?t1 inchikey: ?inchikey . ?t1 smiles: ?smiles . ?t2 inchi: ?inchi . ?t2 inchikey: ?inchikey . ?t2 smiles: ?smiles . FILTER NOT EXISTS { { ?t1 ?p ?t2 . } UNION { ?t2 ?p ?t1 . } } FILTER NOT EXISTS { { ?t1 alt: ?t2 . } UNION { ?t2 alt: ?t1 . } } FILTER (?t1 != ?t2) } ```

Query 2 - https://api.triplydb.com/s/kI7V1CK9p Same as above but without the requirement for having same smiles Result: 358 pairs

SPARQL Query 2 ```sparql PREFIX owl: PREFIX inchi: PREFIX inchikey: PREFIX dcterms: PREFIX obo: PREFIX rdfs: PREFIX tautomer: PREFIX alt: SELECT ?t1 ?t2 ?p WHERE { ?t1 rdfs:isDefinedBy . ?t2 rdfs:isDefinedBy . ?t1 inchi: ?inchi . ?t1 inchikey: ?inchikey . ?t2 inchi: ?inchi . ?t2 inchikey: ?inchikey . FILTER NOT EXISTS { { ?t1 ?p ?t2 . } UNION { ?t2 ?p ?t1 . } } FILTER NOT EXISTS { { ?t1 alt: ?t2 . } UNION { ?t2 alt: ?t1 . } } FILTER (?t1 != ?t2) } ```

Hope this helps and thanks :)

Tagging @tommycarstensen too

shawntanzk commented 1 year ago

added labels to the query and it seems some might actually be different, but some might be duplicates https://api.triplydb.com/s/iQW_wmvz9