TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Two Cisplatins #127

Open cbizon opened 1 year ago

cbizon commented 1 year ago

We have 2 different "Cisplatin" entries in Babel/NN.

A main entry: https://nodenormalization-sri.renci.org/1.3/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A5460033&conflate=true

And then this one is a single CHEMBL https://nodenormalization-sri.renci.org/1.3/get_normalized_nodes?curie=CHEMBL.COMPOUND%3ACHEMBL2068237&conflate=true

You may at first think that CHEMBL is just not integrating, but the PUBCHEM one above actually contains a CHEMBL ID.
The actual problem is that CHEMBL contains two identifiers for CISPLATIN:

https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2068237/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL11359/

From the CHEMBL page it's not clear to me what the difference is. They link to Pubchem SIDs that make it look like a chiral difference, but all of those SIDs link to the same CID (which is not the pubchem CID above due to charge state), and none of these pubchem entries call themselves cisplatin.

Fundamentally it seems as though chembl is not very happy with metal containing compounds. From their faq:

Can you provide more details on the removal of Metal-Containing compounds? The molfiles and images of a proportion of metal-containing compounds we removed from the ChEMBL interface and downloads set in ChEMBL_17. This was partially due to some of these compounds having coordinated metal bonds. As InChI limitations are such that these coordinate bonds could not generate a Standard InChI, our main compound indicator of uniqueness in ChEMBL, it was decided to exclude the structures altogether. The compound image on the interface was replaced with an icon that shows it is a metal-containing compound and the molfiles were removed from the download set on the FTP site. We will retain the molecular formula in both the download files and on the ChEMBL interface, so that the elemental make up of the compound is visible. This change does not affect the storage or display of the associated bioactivity data for these compounds.

Without any molfile or other structure, I'm not sure how we're supposed to link 2068237 to anything. I'll dig around to see if I can figure out how 11359 is getting merged.

cbizon commented 1 year ago

For what it's worth, it also seems like the Babel is combining cisplatin and transplatin, which are enantiomers of one another

cbizon commented 1 year ago

In terms of how the chembl's are coming in: 11359 is being pulled in via a link from DrugCentral. 2068237 is not being pulled at all.

For most chembl mappings, we rely on UNICHEM, but there aren't inchi's for this b/c of the metal.