biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 13 forks source link

New API for Integrated Dietary Supplement Knowledge Base (iDISK) #22

Closed andrewsu closed 3 years ago

andrewsu commented 3 years ago

Given the emphasis on rare diseases in Translator, and given that rare diseases often have a metabolic origin, and given that the bar for trying an off-label treatment for a pharmaceutical compound in a human patient is relatively high, it would be useful to have an API resource that links dietary supplements to other biomedical entities. The Integrated Dietary Supplement Knowledge Base (iDISK) appears to be an excellent resource, and it would be very useful for Translator to create a new API for this.

iDISK encompasses a terminology of 4208 DS ingredient concepts, which are linked via 6 relationship types to 495 drugs, 776 diseases, 985 symptoms, 605 therapeutic classes, 17 system organ classes, and 137 568 DS products. iDISK also contains 7 concept attribute types and 3 relationship attribute types. Evaluation of the data extraction and integration process showed average errors of 0.3%, 2.6%, and 0.4% for concepts, relationships and attributes, respectively

data: https://conservancy.umn.edu/handle/11299/204783 (in a UMLS-like format, or in a neo4j dump) publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7075538/

andrewsu commented 3 years ago

The first step would be to create a data parser, following the instructions here: https://docs.biothings.io/en/latest/doc/studio.html#part-1-single-datasource. In this case, the primary key for the document should be a concept identifier for a dietary ingredient, and the document should contain all the relationships stated in the database.

For example, here is a set of lines from MRREL.RRF, which contains these relationships:

DR1378627|DC0477352|is_effective_for|DC0478476|NMCD
DR1378629|DC0477352|has_adverse_effect_on|DC0478449|NMCD
DR1378613|DC0477352|is_effective_for|DC0478466|NMCD

this snippet declares that DC0477352 has three relationships, that it is is_effective_for - DC0478476, that it has_adverse_effect_on - DC0478449, and that it is_effective_for - DC0478466. All of those facts were originally sourced from the "NMCD" database. Those concept identifiers can be resolved in the MRCONSO.RRF file using either UMLS or MEDDRA identifiers:

$ egrep 'DC0477352|DC0478476|DC0478466|DC0478449' MRCONSO.RRF | egrep 'UMLS|MEDDRA'
DC0477352|DA0137592|7-oxodehydroepiandrosterone|SY|UMLS|C0525091|Y
DC0477352|DA0137591|7-oxodehydroepiandrosterone|SY|UMLS|C0525091|Y
DC0478449|DA0170196|Nervous system disorders|SY|MEDDRA|10029205|Y
DC0478466|DA0170257|Obesity, NOS|SY|UMLS|C0028754|Y
DC0478476|DA0170285|RAYNAUD PHENOMENON|SY|UMLS|C0034735|Y
DC0478476|DA0170283|RAYNAUD PHENOMENON|SY|UMLS|C0034735|Y

So the dietary ingredient 7-oxodehydroepiandrosterone (UMLS: C0525091) is related to RAYNAUD PHENOMENON (UMLS: C0034735) because 7-oxodehydroepiandrosterone is_effective_for (treating) RAYNAUD PHENOMENON. Similarly, that ingredient has_adverse_effect_on - Nervous systems disorders, and it is_effective_for - Obesity, NOS.

So from this set of records, the parser should create this document:

{
  "UMLS:C0525091": {
    "name": "7-oxodehydroepiandrosterone",
    "umls": "UMLS:C0525091",
    "is_effective_for": [
      {
        "name": "RAYNAUD PHENOMENON",
        "umls": "UMLS:C0034735",
        "source": [
          {
            "name": "iDISK",
            "record": "DR1378627"
          },
          {
            "name": "NMCD"
          }
        ]
      },
      {
        "name": "Obesity, NOS",
        "umls": "UMLS:C0028754",
        "source": [
          {
            "name": "iDISK",
            "record": "DR1378613"
          },
          {
            "name": "NMCD"
          }
        ]
      }
    ],
    "has_adverse_effect_on": [
      {
        "name": "Nervous systems disorders",
        "meddra": "MEDDRA:0029205",
        "source": [
          {
            "name": "iDISK",
            "record": "DR1378629"
          },
          {
            "name": "NMCD"
          }
        ]
      }
    ]
  }
}

The details may change a bit later, but that basic structure should hold. Once the parser is done, we should be able to turn it over to the BioThings SDK to complete the creation of a BioThings API.

r76941156 commented 3 years ago

I can find some records that are related to chemical compound. @andrewsu Below are some questions that need your help. Thanks.

  1. Output format
  2. Some records have many inchkeys (e.g., Vitamin B12). Do you want to create multiple copies if you use inchkey as _id?
  3. The object in some records may not have inchkey (did not find in either pubchem and/or mychem). Do you still want to keep them? Thanks.
andrewsu commented 3 years ago
  1. What do you mean by output format? something different than the example document in my second comment above?
  2. I think one record per inchi key sounds right.
  3. Given that this is a database of dietary supplements (including many extracts, natural products, etc.) it's not surprising that many will not have an inchikey (or even a defined chemical structure). Am I remembering correctly that the majority of idisk records fall into this camp? If yes, we will need some way to refer to them.

While looking into this, I went into a little bit of a rabbit hole looking for downloadable files for the non-MeSH, non-UMLS databases. Only one I found so far was this one https://www.canada.ca/en/health-canada/services/drugs-health-products/natural-non-prescription/applications-submissions/product-licensing/licensed-natural-health-product-database-data-extract.html

r76941156 commented 3 years ago

Please see below example. If one record has many inchi keys, we will have many copies with different inchi key (subject level). For object level, I will put keys in the list (or string), same as pubchem cids. For 3, I can include the records that do not have inchi key in object level. In addition, I will try to use different approaches to include more records.

  {
     "_id": "HCHKCACWOHOZIP-UHFFFAOYSA-N",
    "name": "Zinc",
    "inchikey": "HCHKCACWOHOZIP-UHFFFAOYSA-N",
    "pubchem_cid": 23994,
    "has_ingredient": [
        {
            "name": "Zinc Pyrithione",
            "inchikey": [
                "OTPSWLRZXRHDNX-UHFFFAOYSA-L",
                "VOHCMATXIKWIKC-UHFFFAOYSA-N",
                "OTPSWLRZXRHDNX-UHFFFAOYSA-L",
                "PDVKVYBCOWRIGI-UHFFFAOYSA-M"
            ],
            "nmcd": "NMCD:982",
            "source": [
                {
                    "name": "iDisk",
                    "record": "DR0671969"
                },
                {
                    "name": "NHPID"
                }
            ],
            "pubchem_cid": [
                26041,
                56836326,
                415267,
                129627951
            ]
        },
        {
            "name": "Zinc Chloride",
            "inchikey": "JIAARYAFYJHUJI-UHFFFAOYSA-L",
            "nmcd": "NMCD:982",
            "source": [
                {
                    "name": "iDisk",
                    "record": "DR0671969"
                },
                {
                    "name": "NHPID"
                }
            ],
            "pubchem_cid": 5727
        }
   }
r76941156 commented 3 years ago

@andrewsu The predication summary in the relation file (MRREL):

a1. has_ingredient 689,297 a2. has_adverse_effect_on 3,120 a3. interacts_with 3,057 a4. has_therapeutic_lcass 5,443 a5. is_effective_for 5,154 a6. has_adverse_reaction 2,093

I believe we already addressed a2-a6 in our original parser. For a1, 98% of CUI1s (subjects) are brand names or human names. So, there is no good approach to process the text of these records (Reference) .

However, if you to want to include all records of a1, we can do an Inverse approach:
CUI1 - has ingredient - CUI2 -> CUI2- ingredient_of - CUI1

This can build records to link all CUI1s since all CUI2s have id types we want (Reference) .

Otherwise, I can only process 2% CUI1s in a1 using the original approach.

andrewsu commented 3 years ago

To summarize a slack discussion:

So let's consider the data loading part of this issue done. Reassigning to @colleenXu to create a SmartAPI mapping file and registry record so it can be queried by BTE.

colleenXu commented 3 years ago

@r76941156 Could the API be adjusted to remove the ID prefix from fields like umls, has_adverse_effect_on.meddra?

Having the prefix causes some issues with BTE batch-querying and processing results, and we already know the ID's prefix/namespace because it's the field name...

r76941156 commented 3 years ago

@colleenXu. I am fine with that as long as @andrewsu approve this. Thanks.

andrewsu commented 3 years ago

Yes please, thanks @r76941156!

r76941156 commented 3 years ago

@andrewsu @colleenXu I already finished my part. @newgene @erikyao may deploy this to PRD next week since the pending hub is doing a db migration.

andrewsu commented 3 years ago

Thanks! Commit for reference: https://github.com/r76941156/iDisk/commit/bd31518193bb991b8d58900ba98ff0f71b97fb99

erikyao commented 3 years ago

Hi guys, I met a problem in our new biothings.api when releasing the new iDisk plugin. Already reported to @newgene. Will keep you updated.

andrewsu commented 3 years ago

sorry, must have accidentally clicked the wrong button -- reopening pending deployment of the latest changes and confirmation that it can be queried via BTE...

erikyao commented 3 years ago

Our recent upgrade to ElasticSearch v7.0 and Biothings API v0.10 has caused some minor technical issues to us. E.g.

@namespacestd0 and I will keep you updated on the progress of the fixture.

erikyao commented 3 years ago

Updated! Please check https://pending.biothings.io/idisk

andrewsu commented 3 years ago

looks good to me! @colleenXu, please confirm the modified structure works with your smartAPI annotations...