Closed andrewsu closed 3 years ago
The first step would be to create a data parser, following the instructions here: https://docs.biothings.io/en/latest/doc/studio.html#part-1-single-datasource. In this case, the primary key for the document should be a concept identifier for a dietary ingredient, and the document should contain all the relationships stated in the database.
For example, here is a set of lines from MRREL.RRF
, which contains these relationships:
DR1378627|DC0477352|is_effective_for|DC0478476|NMCD
DR1378629|DC0477352|has_adverse_effect_on|DC0478449|NMCD
DR1378613|DC0477352|is_effective_for|DC0478466|NMCD
this snippet declares that DC0477352
has three relationships, that it is is_effective_for
- DC0478476
, that it has_adverse_effect_on
- DC0478449
, and that it is_effective_for
- DC0478466
. All of those facts were originally sourced from the "NMCD" database. Those concept identifiers can be resolved in the MRCONSO.RRF
file using either UMLS or MEDDRA identifiers:
$ egrep 'DC0477352|DC0478476|DC0478466|DC0478449' MRCONSO.RRF | egrep 'UMLS|MEDDRA'
DC0477352|DA0137592|7-oxodehydroepiandrosterone|SY|UMLS|C0525091|Y
DC0477352|DA0137591|7-oxodehydroepiandrosterone|SY|UMLS|C0525091|Y
DC0478449|DA0170196|Nervous system disorders|SY|MEDDRA|10029205|Y
DC0478466|DA0170257|Obesity, NOS|SY|UMLS|C0028754|Y
DC0478476|DA0170285|RAYNAUD PHENOMENON|SY|UMLS|C0034735|Y
DC0478476|DA0170283|RAYNAUD PHENOMENON|SY|UMLS|C0034735|Y
So the dietary ingredient 7-oxodehydroepiandrosterone
(UMLS: C0525091
) is related to RAYNAUD PHENOMENON
(UMLS: C0034735
) because 7-oxodehydroepiandrosterone
is_effective_for
(treating) RAYNAUD PHENOMENON
. Similarly, that ingredient has_adverse_effect_on
- Nervous systems disorders
, and it is_effective_for
- Obesity, NOS
.
So from this set of records, the parser should create this document:
{
"UMLS:C0525091": {
"name": "7-oxodehydroepiandrosterone",
"umls": "UMLS:C0525091",
"is_effective_for": [
{
"name": "RAYNAUD PHENOMENON",
"umls": "UMLS:C0034735",
"source": [
{
"name": "iDISK",
"record": "DR1378627"
},
{
"name": "NMCD"
}
]
},
{
"name": "Obesity, NOS",
"umls": "UMLS:C0028754",
"source": [
{
"name": "iDISK",
"record": "DR1378613"
},
{
"name": "NMCD"
}
]
}
],
"has_adverse_effect_on": [
{
"name": "Nervous systems disorders",
"meddra": "MEDDRA:0029205",
"source": [
{
"name": "iDISK",
"record": "DR1378629"
},
{
"name": "NMCD"
}
]
}
]
}
}
The details may change a bit later, but that basic structure should hold. Once the parser is done, we should be able to turn it over to the BioThings SDK to complete the creation of a BioThings API.
I can find some records that are related to chemical compound. @andrewsu Below are some questions that need your help. Thanks.
While looking into this, I went into a little bit of a rabbit hole looking for downloadable files for the non-MeSH, non-UMLS databases. Only one I found so far was this one https://www.canada.ca/en/health-canada/services/drugs-health-products/natural-non-prescription/applications-submissions/product-licensing/licensed-natural-health-product-database-data-extract.html
Please see below example. If one record has many inchi keys, we will have many copies with different inchi key (subject level). For object level, I will put keys in the list (or string), same as pubchem cids. For 3, I can include the records that do not have inchi key in object level. In addition, I will try to use different approaches to include more records.
{
"_id": "HCHKCACWOHOZIP-UHFFFAOYSA-N",
"name": "Zinc",
"inchikey": "HCHKCACWOHOZIP-UHFFFAOYSA-N",
"pubchem_cid": 23994,
"has_ingredient": [
{
"name": "Zinc Pyrithione",
"inchikey": [
"OTPSWLRZXRHDNX-UHFFFAOYSA-L",
"VOHCMATXIKWIKC-UHFFFAOYSA-N",
"OTPSWLRZXRHDNX-UHFFFAOYSA-L",
"PDVKVYBCOWRIGI-UHFFFAOYSA-M"
],
"nmcd": "NMCD:982",
"source": [
{
"name": "iDisk",
"record": "DR0671969"
},
{
"name": "NHPID"
}
],
"pubchem_cid": [
26041,
56836326,
415267,
129627951
]
},
{
"name": "Zinc Chloride",
"inchikey": "JIAARYAFYJHUJI-UHFFFAOYSA-L",
"nmcd": "NMCD:982",
"source": [
{
"name": "iDisk",
"record": "DR0671969"
},
{
"name": "NHPID"
}
],
"pubchem_cid": 5727
}
}
@andrewsu The predication summary in the relation file (MRREL):
a1. has_ingredient 689,297 a2. has_adverse_effect_on 3,120 a3. interacts_with 3,057 a4. has_therapeutic_lcass 5,443 a5. is_effective_for 5,154 a6. has_adverse_reaction 2,093
I believe we already addressed a2-a6 in our original parser. For a1, 98% of CUI1s (subjects) are brand names or human names. So, there is no good approach to process the text of these records (Reference) .
However, if you to want to include all records of a1, we can do an Inverse approach:
CUI1 - has ingredient - CUI2 -> CUI2- ingredient_of - CUI1
This can build records to link all CUI1s since all CUI2s have id types we want (Reference) .
Otherwise, I can only process 2% CUI1s in a1 using the original approach.
To summarize a slack discussion:
has_ingredient
edges is not a priority right nowCUI1
s, and those edges are indexed in the 919 documents available in https://biothings.ncats.io/idiskSo let's consider the data loading part of this issue done. Reassigning to @colleenXu to create a SmartAPI mapping file and registry record so it can be queried by BTE.
@r76941156 Could the API be adjusted to remove the ID prefix from fields like umls, has_adverse_effect_on.meddra?
Having the prefix causes some issues with BTE batch-querying and processing results, and we already know the ID's prefix/namespace because it's the field name...
@colleenXu. I am fine with that as long as @andrewsu approve this. Thanks.
Yes please, thanks @r76941156!
@andrewsu @colleenXu I already finished my part. @newgene @erikyao may deploy this to PRD next week since the pending hub is doing a db migration.
Thanks! Commit for reference: https://github.com/r76941156/iDisk/commit/bd31518193bb991b8d58900ba98ff0f71b97fb99
Hi guys, I met a problem in our new biothings.api
when releasing the new iDisk plugin. Already reported to @newgene. Will keep you updated.
sorry, must have accidentally clicked the wrong button -- reopening pending deployment of the latest changes and confirmation that it can be queried via BTE...
Our recent upgrade to ElasticSearch v7.0 and Biothings API v0.10 has caused some minor technical issues to us. E.g.
@namespacestd0 and I will keep you updated on the progress of the fixture.
Updated! Please check https://pending.biothings.io/idisk
looks good to me! @colleenXu, please confirm the modified structure works with your smartAPI annotations...
Given the emphasis on rare diseases in Translator, and given that rare diseases often have a metabolic origin, and given that the bar for trying an off-label treatment for a pharmaceutical compound in a human patient is relatively high, it would be useful to have an API resource that links dietary supplements to other biomedical entities. The Integrated Dietary Supplement Knowledge Base (iDISK) appears to be an excellent resource, and it would be very useful for Translator to create a new API for this.
data: https://conservancy.umn.edu/handle/11299/204783 (in a UMLS-like format, or in a neo4j dump) publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7075538/