Open colleenXu opened 1 month ago
@colleenXu, the latest data was released to the CI environment.
I'm looking at the current CI responses now...
I think there's a parsing issue with subject.uniprot.secondary_accession
. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065"
. Compare it to the same document in ncats.io.
Regarding problem 1 (relation.bindingdb_link
urls not reaching the actual webpages)...
This seems to be addressed in CI! It looks like enzyme names were updated, which meant the webpage urls also needed to be updated.
Regarding problem 2 (object
field values are incorrect/problematic/outdated)...
Some problems were addressed in CI!
object.chembl
: multiple IDs now seem to be correctly split. I checked all previous examples. Note that I still haven't checked how reliable/accurate these IDs are. object.name
: multiple values now seem to be correctly splitOne idea is double-check how reliable the chembl IDs are, and if they're good, to switch BTE/x-bte annotation to using it rather than inchikey (current)/pubchem_cid (previous).
However, this would decrease our coverage of this resource to <50% (old breakdown's proportions are still roughly correct).
Some problems still exist. We may have to dig deeper into the data/parser to figure these out...
object.pubchem_cid
: all "incorrect" values are still there (more details in another post)object.inchikey
: all "incorrect" values are still there. Example 1: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:YQCLAYRIYWYIKH-UHFFFAOYSA-N) `YQCLAYRIYWYIKH-UHFFFAOYSA-N`. But [Translator's NodeNorm doesn't recognize this ID](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:YQCLAYRIYWYIKH-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) and it maps the object chembl IDs to slightly different inchikeys: * [CHEMBL341945](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL341945&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-MKCFTUBBSA-N` * [CHEMBL106813](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL106813&conflate=true&drug_chemical_conflate=true&description=false) to `YQCLAYRIYWYIKH-WGPBWIAQSA-N` Example 2: [CI has the object.inchikey](https://biothings.ci.transltr.io/bindingdb/query?q=object.inchikey:ZUXABONWMNSFBN-UHFFFAOYSA-N) `ZUXABONWMNSFBN-UHFFFAOYSA-N` for clozapine. But [Translator's NodeNorm treats this inchikey as a different entity `3-chloro-6-(4-methyl-1-piperazinyl)-5H-benzo[b][1,4]benzodiazepine`](https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=INCHIKEY:ZUXABONWMNSFBN-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false). Instead, [NodeNorm](https://nodenorm.ci.transltr.io/get_normalized_nodes?curie=INCHIKEY:QZUDBNBUXVUHMW-UHFFFAOYSA-N&conflate=true&drug_chemical_conflate=true&description=false) uses a different inchikey for clozapine: `QZUDBNBUXVUHMW-UHFFFAOYSA-N`
And a note: problem 3 (optional, more specific relationships) hasn't been addressed yet.
I'm looking at the current CI responses now...
I think there's a parsing issue with
subject.uniprot.secondary_accession
. In this document, it looks like the 1-string-element should have been split for each value"B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065"
. Compare it to the same document in ncats.io.
Hi @colleenXu,
Now, the field subject.uniprot.secondary_accession
has the values split for each value.
It's deployed to the CI environment. Let me know if it is as expected.
@everaldorodrigo
subject.uniprot.secondary_accession
now looks wrong in a different way.
Sometimes the array's last value is an array (a duplication happening somewhere)? Examples:
good catch @colleenXu !
Also want to mention that this kind of parsing issue can be identified at its early stage if we run the inspect
step after the data upload. It should warn a field if its values have mixed data types. @everaldorodrigo
@newgene @andrewsu @everaldorodrigo @rjawesome
It looks like there's a few problems with the current BioThings Binding DB API, and it would be helpful to fix these and maybe update the data.
relation.bindingdb_link
urls now don't work. I wonder if some urls were updated...and maybe using a recent data release would help.relation.bindingdb_link
url. I think this relationship still exists - see the bottom row here.Note: