biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

Issues with BioThings BindingDB object fields #717

Open colleenXu opened 10 months ago

colleenXu commented 10 months ago

(CC @newgene @erikyao for pending BioThings, @rjawesome as the original person who worked on the parser https://github.com/biothings/pending.api/issues/70)

Andy Crouse from Translator's UI team pointed out that BTE was returning result chemical Nodes that didn't have names, with edges from BioThings BindingDB (Translator Slack link).

When investigating this, I discovered cases where object fields (the chemical), specifically ID and name ones, seem incorrect or problematic (see the comments). Some notes:

colleenXu commented 10 months ago

object.pubchem_cid

expand for detailed example

Example: https://biothings.ncats.io/bindingdb/query?q=object.pubchem_cid:4585 ``` "object":{ "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=35254", "chebi": "7735", "chembl": "CHEMBL715", "drugbank": "DB00334", "inchi": "InChI=1S/C17H20N4S/c1-12-11-13-16(21-9-7-20(2)8-10-21)18-14-5-3-4-6-15(14)19-17(13)22-12/h3-6,11,19H,7-10H2,1-2H3", "inchikey": "KVWDHTXUZHCGIO-UHFFFAOYSA-N", "iuphar_grac_id": "47", "kegg": "C07322", "monomer_id": 35254, "name": "2-methyl-4-(4-methylpiperazin-1-yl)-10H-thieno[2,3-b][1,5]benzodiazepine::CHEMBL715::OLANZAPINE::Olansek::US8802672, Olanzapine::Zyprexa::olanzapine", "pubchem_cid": 4585, "pubchem_sid": 85753333, "smiles": "CN1CCN(CC1)C1=Nc2ccccc2Nc2sc(C)cc12", "zinc": "ZINC52957434" }, ``` * pubchem compound ID (CID) 4585 doesn't exist: https://pubchem.ncbi.nlm.nih.gov/#query=4585 * pubchem substance ID (SID) 85753333 [does exist](https://pubchem.ncbi.nlm.nih.gov/substance/85753333). It links to the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) * I checked the chebi, chembl, drugbank, inchikey, and kegg IDs; these are all mapped IDs on the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) page * could 4585 be from the ID `NSC_4585`? I see that ID as a "synonym" on the [pubchem cid 135398745 (Olanzapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398745) page

Other examples with basically the same behavior (notes here):

colleenXu commented 10 months ago

object.inchikey

example 1

* from https://biothings.ncats.io/bindingdb/query?q=object.inchikey:YQCLAYRIYWYIKH-UHFFFAOYSA-N, see code chunk below for the object * [the pubchem cid (9579327)'s page](https://pubchem.ncbi.nlm.nih.gov/compound/9579327) has a different inchikey: YQCLAYRIYWYIKH-WGPBWIAQSA-N * Translator's Node Norm [doesn't recognize BindingDB's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AYQCLAYRIYWYIKH-UHFFFAOYSA-N&conflate=true) but [does recognize the pubchem cid's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AYQCLAYRIYWYIKH-WGPBWIAQSA-N&conflate=true) * the issue could be that there's multiple isomers/chemicals involved? Notice how there's two chembl IDs in object.chembl and object.name fields... ``` "object": { "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=50115522", "chembl": "CHEMBL341945CHEMBL106813", "inchi": "InChI=1S/C28H35F3N4O4/c1-19-9-16-35(37)20(2)24(19)26(36)33-17-12-27(3,13-18-33)34-14-10-22(11-15-34)25(32-38-4)21-5-7-23(8-6-21)39-28(29,30)31/h5-9,16,22H,10-15,17-18H2,1-4H3", "inchikey": "YQCLAYRIYWYIKH-UHFFFAOYSA-N", "monomer_id": 50115522, "name": "(2,4-Dimethyl-1-oxy-pyridin-3-yl)-{4-[methoxyimino-(4-trifluoromethoxy-phenyl)-methyl]-4'-methyl-[1,4']bipiperidinyl-1'-yl}-methanone::CHEMBL106813::CHEMBL341945", "pubchem_cid": 9579327, "pubchem_sid": 104009437, "smiles": "CO[N-][C+](C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(C)cc[n+]([O-])c1C)c1ccc(OC(F)(F)F)cc1", "zinc": "ZINC26848826" }, ```

example 2

* https://biothings.ncats.io/bindingdb/query?q=object.inchikey:ZUXABONWMNSFBN-UHFFFAOYSA-N, see code chunk below for the object. This object ALSO has the "incorrect" object.pubchem_cid field issue (listed in the previous comment as pubchem_cid [2818](https://biothings.ncats.io/bindingdb/query?q=object.pubchem_cid:2818)) * pubchem compound ID (CID) 2818 doesn't exist: https://pubchem.ncbi.nlm.nih.gov/#query=2818 * pubchem substance ID (SID) 49846683 [does exist](https://pubchem.ncbi.nlm.nih.gov/substance/49846683). It links to the [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737) * but [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737)'s page has a different inchikey: QZUDBNBUXVUHMW-UHFFFAOYSA-N * Translator's Node Norm seems to recognize [BindingDB's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AZUXABONWMNSFBN-UHFFFAOYSA-N&conflate=true) and [pubchem cid's inchikey](https://nodenorm.test.transltr.io/1.3/get_normalized_nodes?curie=INCHIKEY%3AQZUDBNBUXVUHMW-UHFFFAOYSA-N&conflate=true) as two separate entities * interestingly, BindingDB's inchikey does show up in a source url on [pubchem cid 135398737 (Clozapine)](https://pubchem.ncbi.nlm.nih.gov/compound/135398737)'s page (search for "MassBank of North America (MoNA)" and look at the urls) ``` "object": { "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=22869", "inchi": "InChI=1S/C18H19ClN4/c1-22-8-10-23(11-9-22)18-14-4-2-3-5-15(14)20-16-7-6-13(19)12-17(16)21-18/h2-7,12,21H,8-11H2,1H3", "inchikey": "ZUXABONWMNSFBN-UHFFFAOYSA-N", "iuphar_grac_id": "38", "monomer_id": 22869, "name": "6-chloro-10-(4-methylpiperazin-1-yl)-2,9-diazatricyclo[9.4.0.0^{3,8}]pentadeca-1,3(8),4,6,10,12,14-heptaene::CLOZARIL::Clozapine::Leponex::US10259786, Clozapine", "pubchem_cid": 2818, "pubchem_sid": 49846683, "smiles": "CN1CCN(CC1)C1=c2ccccc2=Nc2ccc(Cl)cc2N1", "zinc": "ZINC19796155" }, ```

colleenXu commented 10 months ago

object.chembl

Sometimes the values seem to be concatenated strings of multiple IDs:

Examples:

colleenXu commented 10 months ago

object.name

For examples, click on any of the BioThings BindingDB links above, and look at the object.name value.

colleenXu commented 10 months ago

And this is what I did to address Andy Crouse's original problem of "Nodes with no names from BioThings BindingDB":

(pasted from https://github.com/biothings/pending.api/issues/99#issuecomment-1703787808)

For now, I've changed the x-bte annotation for this resource https://github.com/NCATS-Tangerine/translator-api-registry/commit/022e8765adfed8a67feae9b8e3810c687c0e0e40:

(1) Use object.inchikey:

  • covers a little over 96% of the resource (1394153 / 1438909)
  • my hunch is that the INCHIKEY IDs are not completely incorrect, VS the pubchem cids are sometimes incorrect and I don't really know why
  • VS object.pubchem_cid (used before) covers a little over 98% of the resource (1413051 / 1438909)
  • other fields cover much less of the resource or aren't supported by Node Norm:
    • object.pubchem_sid: 1413131 but Node Norm doesn't seem to support this ID namespace right now
    • object.chembl: 631745. Has its own issues, see last comment
    • object.kegg (KEGG.COMPOUND): 33680
    • object.chebi: 25903
    • object.drugbank: 24177

(2) Retrieve subject.name and object.name fields for input_name/output_name behavior, if Node Norm doesn't retrieve info for the ID. Every document has those fields, but the names provided have issues (will be covered in a later post).

colleenXu commented 5 months ago

@everaldorodrigo @newgene @andrewsu

I think this would be a useful issue for @everaldorodrigo to dig into and address as much as possible.

everaldorodrigo commented 1 month ago
  • I've been assuming that the object.pubchem_sid value is a "correct" ID (when the document has this field, which is >98% of documents or 1413131 / 1438909)
  • I'm not sure how many documents are affected
  • I'm not sure if the other chemical (object) fields have similar issues

@colleenXu, considering the last released data from May 2024 to the CI environment, a total of 41678 from 1795611 items don't have the field object.pubchem_sid.

For those cases missing the object.pubchem_sid field, seems the data source (.tsv file) doesn't have the value for the column PubChem SID used in the parser to fill the field object.pubchem_sid.

The partial data below is extracted from the data source. There are two lines. The header and an example of item missing the field PubChem SID. Look for PubChem SID in the table below:

BindingDB Reactant_set_id Ligand SMILES Ligand InChI Ligand InChI Key BindingDB MonomerID BindingDB Ligand Name Target Name Target Source Organism According to Curator or DataSource Ki (nM) IC50 (nM) Kd (nM) EC50 (nM) kon (M-1-s-1) koff (s-1) pH Temp (C) Curation/DataSource Article DOI BindingDB Entry DOI PMID PubChem AID Patent Number Authors Institution Link to Ligand in BindingDB Link to Target in BindingDB Link to Ligand-Target Pair in BindingDB Ligand HET ID in PDB PDB ID(s) for Ligand-Target Complex PubChem CID PubChem SID ChEBI ID of Ligand ChEMBL ID of Ligand DrugBank ID of Ligand IUPHAR_GRAC ID of Ligand KEGG ID of Ligand ZINC ID of Ligand Number of Protein Chains in Target (>1 implies a multichain complex) BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain
1187404 CS(=O)(=O)N1CCC@Hc(=O)n(C4CCCC4)c3n2)C@@HC1 InChI=1S/C20H27FN6O4S/c1-32(30,31)26-7-6-16(15(21)11-26)24-20-23-10-13-8-12(9-17(22)28)19(29)27(18(13)25-20)14-4-2-3-5-14/h8,10,14-16H,2-7,9,11H2,1H3,(H2,22,28)(H,23,24,25)/t15-,16-/m0/s1 LGDFLYMYRBSZOR-HOTGVXAUSA-N 370273 BDBM467168::US10233188, Example 160 Cyclin-dependent kinase 6/G1/S-specific cyclin-D1 [L188C] Homo sapiens 3.55               US Patent   10.7270/Q2KK9G2V   aid1806677 US11396512 Behenna, DC; Chen, P; Freeman-Cook, KD; Hoffman, RL; Jalaie, M; Nagata, A; Nair, SK; Ninkovic, S; Ornelas, MA; Palmer, CL; Rui, EY Pfizer Inc. http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=370273 http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=99&target=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=370273&enzyme=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search                     2 mehqllccevetirraypdanllndrvlramlkaeetcapsvsyfkcvqkevlpsmrkivatwmlevceeqkceeevfplamnyldrflslepvkksrlqllgatcmfvaskmketipltaeklciytdgsirpeellqmelllvnklkwnlaamtphdfiehflskmpeaeenkqiirkhaqtfvascatdvkfisnppsmvaagsvvaavqglnlrspnnflsyyrltrflsrvikcdpdclracqeqieallesslrqaqqnmdpkaaeeeeeeeeevdlactptdvrdvdi   G1/S-specific cyclin-D1 CCND1_HUMAN P24385 Q6LEF0             MEKDGLCRADQQYECVAEIGEGAYGKVFKARDLKNGGRFVALKRVRVQTGEEGMPLSTIREVAVLRHLETFEHPNVVRLFDVCTVSRTDRETKLTLVFEHVDQDLTTYLDKVPEPGVPTETIKDMMFQLLRGLDFLHSHRVVHRDLKPQNILVTSSGQIKLADFGLARIYSFQMALTSVVVTLWYRAPEVLLQSSYATPVDLWSVGCIFAEMFRRKPLFRGSSDVDQLGKILDVIGLPGEEDWPRDVALPRQAFHSKSAQPIEKFVTDIDELGKDLLLKCLTFNPAKRISAYSALSHPYFQDLERCKENLDSHLPPSQNTSELNTA 1G3N,1JOW,1XO2,2EUF,2F2C,3NUP,3NUX,4AUA,4EZ5,4TTH,5L2I,5L2S,5L2T,6OQL,6OQO Cyclin-dependent kinase 6 CDK6_HUMAN Q00534 A4D1G0                                                                                                                                                                                                                                                                                    

Do you think we should use another field for the operations instead of object.pubchem_sid?

colleenXu commented 3 weeks ago

Sorry for the late response.

I think it's okay that some rows don't have a pubchem SID. I've been using the INCHIKEY instead (see the earlier post where I explain that I think it covered most of the resource and was somewhat reliable but still had a problem).