biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 10 forks source link

BioThings repoDB parser changes #169

Closed colleenXu closed 1 month ago

colleenXu commented 5 months ago

PRIORITY: medium. It'd be useful to have for the upcoming biolink-model refactor ("treats"). Higher in priority than #170

While writing the SmartAPI yaml w/ x-bte annotation for BioThings repoDB, I noticed some issues.

After discussion with Andrew yesterday, we agreed that these changes should be made:

  1. changing the parser to create association-centric data (unique combos of drug-disease-status) rather than drug-centric (current) would be helpful, particularly for the upcoming "treats" refactor. I wrote more about the problems with the current data structure in the linked issue https://github.com/biothings/pending.api/issues/77#issuecomment-1867336885
Mockup of what association-centric data may look like

[Right now, there's 1 record for the drug Rituximab](https://biothings.ncats.io/repodb/query?q=repodb.drugbank:DB00073). It'd be transformed into multiple records, 1 for each combo of rituximab + unique disease + unique status. So for rituximab + "Lymphoma, Non-Hodgkin" `C0024305`, there'd be 3 records (3 diff statuses). I didn't include all the info for the "Terminated" record since there's currently 18 objects/clinical-trials in the data. ``` [ { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Approved" }, { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Terminated", "clinical_trial_info": [ { "NCT": "NCT00057343", "phase": "Phase 3" }, { "NCT": "NCT00057447", "detailed_status": "administrative reasons", "phase": "Phase 1/Phase 2" }, .... ] }, { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Withdrawn", "clinical_trial_info": [ { "NCT": "NCT02408042", "phase": "Phase 1/Phase 2" } ] } ] ```

  1. Figure out what the field value "NA" means. If it basically means "not available/applicable", I'd find it helpful if the parser removed the fields with "NA" values. That way BTE would be able to use this field without post-processing to remove "NA".
"NA" is a common value for these fields

* [repodb.indications.NCT](https://biothings.ncats.io/repodb/query?q=repodb.indications.NCT:NA), but the non-"NA" info could be useful publication ref info for BTE * [repodb.indications.phase](https://biothings.ncats.io/repodb/query?q=repodb.indications.phase:NA): BTE may need to use this info in the future as part of the treats-refactor * [repodb.indications.detailed_status](https://biothings.ncats.io/repodb/query?q=repodb.indications.detailed_status:NA)

  1. Double-check whether this API is using the latest data from repoDB (v2.1 2023-06-15) in the version history section of the repodb website). Based on the metadata endpoint, it might be using the latest data. But the original development and deployment was in 2022 before that data release.
colleenXu commented 1 month ago

I think this issue has been addressed, so I'm closing it. I noted that all instances of the APIs were updated here. There were also detailed discussions in the lab Slack (one thread here that ended with all changes agreed on and deployed to CI).

@everaldorodrigo I suggest adding links to the PRs/code changes related to this issue.