biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 10 forks source link

Problems with BioThings BindingDB #201

Open colleenXu opened 1 month ago

colleenXu commented 1 month ago

@newgene @andrewsu @everaldorodrigo @rjawesome

It looks like there's a few problems with the current BioThings Binding DB API, and it would be helpful to fix these and maybe update the data.

  1. Andy Crouse (Translator UI) has found that some relation.bindingdb_link urls now don't work. I wonder if some urls were updated...and maybe using a recent data release would help.
  2. Problems with incorrect, outdated, or problematic object fields. Perhaps using a recent data release would help, PLUS adjusting the parser. I see that Rohan started some work on adjusting the parser...
  3. Not broken, but a nice-to-have-if-possible: adjusting the parser to assign more specific relationships

Note:

everaldorodrigo commented 4 weeks ago

@colleenXu, the latest data was released to the CI environment.

colleenXu commented 2 weeks ago

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

colleenXu commented 2 weeks ago

Regarding problem 1 (relation.bindingdb_link urls not reaching the actual webpages)...

This seems to be addressed in CI! It looks like enzyme names were updated, which meant the webpage urls also needed to be updated.

colleenXu commented 2 weeks ago

Regarding problem 2 (object field values are incorrect/problematic/outdated)...

Some problems were addressed in CI!

One idea is double-check how reliable the chembl IDs are, and if they're good, to switch BTE/x-bte annotation to using it rather than inchikey (current)/pubchem_cid (previous).

However, this would decrease our coverage of this resource to <50% (old breakdown's proportions are still roughly correct).


Some problems still exist. We may have to dig deeper into the data/parser to figure these out...

more investigation into the inchikey examples

Example 1: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:YQCLAYRIYWYIKH-UHFFFAOYSA-N) `YQCLAYRIYWYIKH-UHFFFAOYSA-N`. But [Translator's NodeNorm doesn't recognize this ID](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:YQCLAYRIYWYIKH-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) and it maps the object chembl IDs to slightly different inchikeys: * [CHEMBL341945](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL341945&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-MKCFTUBBSA-N` * [CHEMBL106813](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL106813&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-WGPBWIAQSA-N` Example 2: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:ZUXABONWMNSFBN-UHFFFAOYSA-N) `ZUXABONWMNSFBN-UHFFFAOYSA-N` for clozapine. But [Translator's NodeNorm treats this inchikey as a different entity `3-chloro-6-(4-methyl-1-piperazinyl)-5H-benzo[b][1,4]benzodiazepine`](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:ZUXABONWMNSFBN-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false). Instead, [NodeNorm](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=INCHIKEY:QZUDBNBUXVUHMW-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) uses a different inchikey for clozapine: `QZUDBNBUXVUHMW-UHFFFAOYSA-N`


And a note: problem 3 (optional, more specific relationships) hasn't been addressed yet.

everaldorodrigo commented 3 days ago

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

Hi @colleenXu,

Now, the field subject.uniprot.secondary_accession has the values split for each value.

It's deployed to the CI environment. Let me know if it is as expected.

colleenXu commented 1 day ago

@everaldorodrigo

subject.uniprot.secondary_accession now looks wrong in a different way.

Sometimes the array's last value is an array (a duplication happening somewhere)? Examples:

newgene commented 1 day ago

good catch @colleenXu !

Also want to mention that this kind of parsing issue can be identified at its early stage if we run the inspect step after the data upload. It should warn a field if its values have mixed data types. @everaldorodrigo