Problems with BioThings BindingDB

colleenXu commented 1 month ago

@newgene @andrewsu @everaldorodrigo @rjawesome

It looks like there's a few problems with the current BioThings Binding DB API, and it would be helpful to fix these and maybe update the data.

Andy Crouse (Translator UI) has found that some relation.bindingdb_link urls now don't work. I wonder if some urls were updated...and maybe using a recent data release would help.
- his example (Translator Slack link) comes from this record in the API with this relation.bindingdb_link url. I think this relationship still exists - see the bottom row here.
- but other urls are still working: this record in the API has this working relation.bindingdb_link
Problems with incorrect, outdated, or problematic object fields. Perhaps using a recent data release would help, PLUS adjusting the parser. I see that Rohan started some work on adjusting the parser...
Not broken, but a nice-to-have-if-possible: adjusting the parser to assign more specific relationships

Note:

Previous issue creating the BindingDB API https://github.com/biothings/pending.api/issues/70
BindingDB may update fairly frequently, maybe monthly? See https://www.bindingdb.org/rwd/bind/chemsearch/marvin/Download.jsp

everaldorodrigo commented 4 weeks ago

@colleenXu, the latest data was released to the CI environment.

colleenXu commented 2 weeks ago

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

colleenXu commented 2 weeks ago

Regarding problem 1 (relation.bindingdb_link urls not reaching the actual webpages)...

This seems to be addressed in CI! It looks like enzyme names were updated, which meant the webpage urls also needed to be updated.

in my opening post, I pointed out this record with this problematic bindingdb_link. And the current CI's corresponding record has a different bindingdb_link that works! The only diff I see in the urls is the enzyme name.
I found another example of a record with a problem bindingdb_link. And the current CI's corresponding record has a different bindingdb_link that works!

colleenXu commented 2 weeks ago

Regarding problem 2 (object field values are incorrect/problematic/outdated)...

Some problems were addressed in CI!

object.chembl: multiple IDs now seem to be correctly split. I checked all previous examples. Note that I still haven't checked how reliable/accurate these IDs are.
object.name: multiple values now seem to be correctly split

One idea is double-check how reliable the chembl IDs are, and if they're good, to switch BTE/x-bte annotation to using it rather than inchikey (current)/pubchem_cid (previous).

However, this would decrease our coverage of this resource to <50% (old breakdown's proportions are still roughly correct).

Some problems still exist. We may have to dig deeper into the data/parser to figure these out...

object.pubchem_cid: all "incorrect" values are still there (more details in another post)
object.inchikey: all "incorrect" values are still there.

more investigation into the inchikey examples

Example 1: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:YQCLAYRIYWYIKH-UHFFFAOYSA-N) `YQCLAYRIYWYIKH-UHFFFAOYSA-N`. But [Translator's NodeNorm doesn't recognize this ID](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:YQCLAYRIYWYIKH-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) and it maps the object chembl IDs to slightly different inchikeys: * [CHEMBL341945](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL341945&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-MKCFTUBBSA-N` * [CHEMBL106813](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL106813&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-WGPBWIAQSA-N` Example 2: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:ZUXABONWMNSFBN-UHFFFAOYSA-N) `ZUXABONWMNSFBN-UHFFFAOYSA-N` for clozapine. But [Translator's NodeNorm treats this inchikey as a different entity `3-chloro-6-(4-methyl-1-piperazinyl)-5H-benzo[b][1,4]benzodiazepine`](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:ZUXABONWMNSFBN-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false). Instead, [NodeNorm](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=INCHIKEY:QZUDBNBUXVUHMW-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) uses a different inchikey for clozapine: `QZUDBNBUXVUHMW-UHFFFAOYSA-N`

And a note: problem 3 (optional, more specific relationships) hasn't been addressed yet.

everaldorodrigo commented 3 days ago

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

Hi @colleenXu,

Now, the field subject.uniprot.secondary_accession has the values split for each value.

It's deployed to the CI environment. Let me know if it is as expected.

colleenXu commented 1 day ago

@everaldorodrigo

subject.uniprot.secondary_accession now looks wrong in a different way.

Sometimes the array's last value is an array (a duplication happening somewhere)? Examples:

newgene commented 1 day ago

good catch @colleenXu !

Also want to mention that this kind of parsing issue can be identified at its early stage if we run the inspect step after the data upload. It should warn a field if its values have mixed data types. @everaldorodrigo

biothings / pending.api

Problems with BioThings BindingDB #201