Open colleenXu opened 10 months ago
Example: https://biothings.ncats.io/bindingdb/query?q=object.pubchem_cid:4585 ``` "object":{ "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=35254", "chebi": "7735", "chembl": "CHEMBL715", "drugbank": "DB00334", "inchi": "InChI=1S/C17H20N4S/c1-12-11-13-16(21-9-7-20(2)8-10-21)18-14-5-3-4-6-15(14)19-17(13)22-12/h3-6,11,19H,7-10H2,1-2H3", "inchikey": "KVWDHTXUZHCGIO-UHFFFAOYSA-N", "iuphar_grac_id": "47", "kegg": "C07322", "monomer_id": 35254, "name": "2-methyl-4-(4-methylpiperazin-1-yl)-10H-thieno[2,3-b][1,5]benzodiazepine::CHEMBL715::OLANZAPINE::Olansek::US8802672, Olanzapine::Zyprexa::olanzapine", "pubchem_cid": 4585, "pubchem_sid": 85753333, "smiles": "CN1CCN(CC1)C1=Nc2ccccc2Nc2sc(C)cc12", "zinc": "ZINC52957434" }, ``` * pubchem compound ID (CID) 4585 doesn't exist: https://pubchem.ncbi.nlm.nih.gov/#query=4585 * pubchem substance ID (SID) 85753333 [does exist](https://pubchem.ncbi.nlm.nih.gov/substance/85753333). It links to the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) * I checked the chebi, chembl, drugbank, inchikey, and kegg IDs; these are all mapped IDs on the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) page * could 4585 be from the ID `NSC_4585`? I see that ID as a "synonym" on the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) page
Other examples with basically the same behavior (notes here):
* from https://biothings.ncats.io/bindingdb/query?q=object.inchikey:YQCLAYRIYWYIKH-UHFFFAOYSA-N, see code chunk below for the object * [the pubchem cid (9579327)'s page](https://pubchem.ncbi.nlm.nih.gov/compound/9579327) has a different inchikey: YQCLAYRIYWYIKH-WGPBWIAQSA-N * Translator's Node Norm [doesn't recognize BindingDB's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AYQCLAYRIYWYIKH-UHFFFAOYSA-N&conflate=true) but [does recognize the pubchem cid's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AYQCLAYRIYWYIKH-WGPBWIAQSA-N&conflate=true) * the issue could be that there's multiple isomers/chemicals involved? Notice how there's two chembl IDs in object.chembl and object.name fields... ``` "object": { "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=50115522", "chembl": "CHEMBL341945CHEMBL106813", "inchi": "InChI=1S/C28H35F3N4O4/c1-19-9-16-35(37)20(2)24(19)26(36)33-17-12-27(3,13-18-33)34-14-10-22(11-15-34)25(32-38-4)21-5-7-23(8-6-21)39-28(29,30)31/h5-9,16,22H,10-15,17-18H2,1-4H3", "inchikey": "YQCLAYRIYWYIKH-UHFFFAOYSA-N", "monomer_id": 50115522, "name": "(2,4-Dimethyl-1-oxy-pyridin-3-yl)-{4-[methoxyimino-(4-trifluoromethoxy-phenyl)-methyl]-4'-methyl-[1,4']bipiperidinyl-1'-yl}-methanone::CHEMBL106813::CHEMBL341945", "pubchem_cid": 9579327, "pubchem_sid": 104009437, "smiles": "CO[N-][C+](C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(C)cc[n+]([O-])c1C)c1ccc(OC(F)(F)F)cc1", "zinc": "ZINC26848826" }, ```
* https://biothings.ncats.io/bindingdb/query?q=object.inchikey:ZUXABONWMNSFBN-UHFFFAOYSA-N, see code chunk below for the object. This object ALSO has the "incorrect" object.pubchem_cid field issue (listed in the previous comment as pubchem_cid [2818](https://biothings.ncats.io/bindingdb/query?q=object.pubchem_cid:2818)) * pubchem compound ID (CID) 2818 doesn't exist: https://pubchem.ncbi.nlm.nih.gov/#query=2818 * pubchem substance ID (SID) 49846683 [does exist](https://pubchem.ncbi.nlm.nih.gov/substance/49846683). It links to the [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737) * but [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737)'s page has a different inchikey: QZUDBNBUXVUHMW-UHFFFAOYSA-N * Translator's Node Norm seems to recognize [BindingDB's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AZUXABONWMNSFBN-UHFFFAOYSA-N&conflate=true) and [pubchem cid's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AQZUDBNBUXVUHMW-UHFFFAOYSA-N&conflate=true) as two separate entities * interestingly, BindingDB's inchikey does show up in a source url on [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737)'s page (search for "MassBank of North America (MoNA)" and look at the urls) ``` "object": { "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=22869", "inchi": "InChI=1S/C18H19ClN4/c1-22-8-10-23(11-9-22)18-14-4-2-3-5-15(14)20-16-7-6-13(19)12-17(16)21-18/h2-7,12,21H,8-11H2,1H3", "inchikey": "ZUXABONWMNSFBN-UHFFFAOYSA-N", "iuphar_grac_id": "38", "monomer_id": 22869, "name": "6-chloro-10-(4-methylpiperazin-1-yl)-2,9-diazatricyclo[9.4.0.0^{3,8}]pentadeca-1,3(8),4,6,10,12,14-heptaene::CLOZARIL::Clozapine::Leponex::US10259786, Clozapine", "pubchem_cid": 2818, "pubchem_sid": 49846683, "smiles": "CN1CCN(CC1)C1=c2ccccc2=Nc2ccc(Cl)cc2N1", "zinc": "ZINC19796155" }, ```
Sometimes the values seem to be concatenated strings of multiple IDs:
Examples:
For examples, click on any of the BioThings BindingDB links above, and look at the object.name value.
And this is what I did to address Andy Crouse's original problem of "Nodes with no names from BioThings BindingDB":
(pasted from https://github.com/biothings/pending.api/issues/99#issuecomment-1703787808)
For now, I've changed the x-bte annotation for this resource https://github.com/NCATS-Tangerine/translator-api-registry/commit/022e8765adfed8a67feae9b8e3810c687c0e0e40:
(1) Use
object.inchikey
:
- covers a little over 96% of the resource (1394153 / 1438909)
- my hunch is that the INCHIKEY IDs are not completely incorrect, VS the pubchem cids are sometimes incorrect and I don't really know why
- VS
object.pubchem_cid
(used before) covers a little over 98% of the resource (1413051 / 1438909)- other fields cover much less of the resource or aren't supported by Node Norm:
object.pubchem_sid
: 1413131 but Node Norm doesn't seem to support this ID namespace right nowobject.chembl
: 631745. Has its own issues, see last commentobject.kegg
(KEGG.COMPOUND): 33680object.chebi
: 25903object.drugbank
: 24177(2) Retrieve
subject.name
andobject.name
fields for input_name/output_name behavior, if Node Norm doesn't retrieve info for the ID. Every document has those fields, but the names provided have issues (will be covered in a later post).
@everaldorodrigo @newgene @andrewsu
I think this would be a useful issue for @everaldorodrigo to dig into and address as much as possible.
- I've been assuming that the object.pubchem_sid value is a "correct" ID (when the document has this field, which is >98% of documents or 1413131 / 1438909)
- I'm not sure how many documents are affected
- I'm not sure if the other chemical (object) fields have similar issues
@colleenXu, considering the last released data from May 2024 to the CI environment,
a total of 41678
from 1795611
items don't have the field object.pubchem_sid
.
For those cases missing the object.pubchem_sid
field, seems the data source (.tsv file) doesn't have the value for the column PubChem SID
used in the parser to fill the field object.pubchem_sid
.
The partial data below is extracted from the data source. There are two lines. The header and an example of item missing the field PubChem SID
. Look for PubChem SID
in the table below:
BindingDB Reactant_set_id | Ligand SMILES | Ligand InChI | Ligand InChI Key | BindingDB MonomerID | BindingDB Ligand Name | Target Name | Target Source Organism According to Curator or DataSource | Ki (nM) | IC50 (nM) | Kd (nM) | EC50 (nM) | kon (M-1-s-1) | koff (s-1) | pH | Temp (C) | Curation/DataSource | Article DOI | BindingDB Entry DOI | PMID | PubChem AID | Patent Number | Authors | Institution | Link to Ligand in BindingDB | Link to Target in BindingDB | Link to Ligand-Target Pair in BindingDB | Ligand HET ID in PDB | PDB ID(s) for Ligand-Target Complex | PubChem CID | PubChem SID | ChEBI ID of Ligand | ChEMBL ID of Ligand | DrugBank ID of Ligand | IUPHAR_GRAC ID of Ligand | KEGG ID of Ligand | ZINC ID of Ligand | Number of Protein Chains in Target (>1 implies a multichain complex) | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain | BindingDB Target Chain Sequence | PDB ID(s) of Target Chain | UniProt (SwissProt) Recommended Name of Target Chain | UniProt (SwissProt) Entry Name of Target Chain | UniProt (SwissProt) Primary ID of Target Chain | UniProt (SwissProt) Secondary ID(s) of Target Chain | UniProt (SwissProt) Alternative ID(s) of Target Chain | UniProt (TrEMBL) Submitted Name of Target Chain | UniProt (TrEMBL) Entry Name of Target Chain | UniProt (TrEMBL) Primary ID of Target Chain | UniProt (TrEMBL) Secondary ID(s) of Target Chain | UniProt (TrEMBL) Alternative ID(s) of Target Chain |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1187404 | CS(=O)(=O)N1CCC@Hc(=O)n(C4CCCC4)c3n2)C@@HC1 | InChI=1S/C20H27FN6O4S/c1-32(30,31)26-7-6-16(15(21)11-26)24-20-23-10-13-8-12(9-17(22)28)19(29)27(18(13)25-20)14-4-2-3-5-14/h8,10,14-16H,2-7,9,11H2,1H3,(H2,22,28)(H,23,24,25)/t15-,16-/m0/s1 | LGDFLYMYRBSZOR-HOTGVXAUSA-N | 370273 | BDBM467168::US10233188, Example 160 | Cyclin-dependent kinase 6/G1/S-specific cyclin-D1 [L188C] | Homo sapiens | 3.55 | US Patent | 10.7270/Q2KK9G2V | aid1806677 | US11396512 | Behenna, DC; Chen, P; Freeman-Cook, KD; Hoffman, RL; Jalaie, M; Nagata, A; Nair, SK; Ninkovic, S; Ornelas, MA; Palmer, CL; Rui, EY | Pfizer Inc. | http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=370273 | http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=99&target=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search | http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=370273&enzyme=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search | 2 | mehqllccevetirraypdanllndrvlramlkaeetcapsvsyfkcvqkevlpsmrkivatwmlevceeqkceeevfplamnyldrflslepvkksrlqllgatcmfvaskmketipltaeklciytdgsirpeellqmelllvnklkwnlaamtphdfiehflskmpeaeenkqiirkhaqtfvascatdvkfisnppsmvaagsvvaavqglnlrspnnflsyyrltrflsrvikcdpdclracqeqieallesslrqaqqnmdpkaaeeeeeeeeevdlactptdvrdvdi | G1/S-specific cyclin-D1 | CCND1_HUMAN | P24385 | Q6LEF0 | MEKDGLCRADQQYECVAEIGEGAYGKVFKARDLKNGGRFVALKRVRVQTGEEGMPLSTIREVAVLRHLETFEHPNVVRLFDVCTVSRTDRETKLTLVFEHVDQDLTTYLDKVPEPGVPTETIKDMMFQLLRGLDFLHSHRVVHRDLKPQNILVTSSGQIKLADFGLARIYSFQMALTSVVVTLWYRAPEVLLQSSYATPVDLWSVGCIFAEMFRRKPLFRGSSDVDQLGKILDVIGLPGEEDWPRDVALPRQAFHSKSAQPIEKFVTDIDELGKDLLLKCLTFNPAKRISAYSALSHPYFQDLERCKENLDSHLPPSQNTSELNTA | 1G3N,1JOW,1XO2,2EUF,2F2C,3NUP,3NUX,4AUA,4EZ5,4TTH,5L2I,5L2S,5L2T,6OQL,6OQO | Cyclin-dependent kinase 6 | CDK6_HUMAN | Q00534 | A4D1G0 |
Do you think we should use another field for the operations instead of object.pubchem_sid
?
Sorry for the late response.
I think it's okay that some rows don't have a pubchem SID. I've been using the INCHIKEY instead (see the earlier post where I explain that I think it covered most of the resource and was somewhat reliable but still had a problem).
(CC @newgene @erikyao for pending BioThings, @rjawesome as the original person who worked on the parser https://github.com/biothings/pending.api/issues/70)
Andy Crouse from Translator's UI team pointed out that BTE was returning result chemical Nodes that didn't have names, with edges from BioThings BindingDB (Translator Slack link).
When investigating this, I discovered cases where object fields (the chemical), specifically ID and name ones, seem incorrect or problematic (see the comments). Some notes:
object.pubchem_sid
value is a "correct" ID (when the document has this field, which is >98% of documents or 1413131 / 1438909)