biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

BioThings suppKG: parser, x-bte, adding to BTE #706

Closed colleenXu closed 8 months ago

colleenXu commented 1 year ago

Opening an issue here to better track the status of this effort.

Previous discussion in https://github.com/NCATS-Tangerine/translator-api-registry/pull/122, with the currently-relevant comments starting https://github.com/NCATS-Tangerine/translator-api-registry/pull/122#issuecomment-1679823539 and https://github.com/biothings/pending.api/issues/55#issuecomment-1135403174

Currently some concerns related to the data/parser...

colleenXu commented 1 year ago

Thanks to @mnarayan1, we have a SmartAPI yaml https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/suppkg/suppkg.yaml that covers supplement treatments for disease. We were able to use templated requestBody to generate a BioThings query structure that we haven't tried before: setting a field to multiple possible values using OR.

I've registered the SmartAPI yaml https://smart-api.info/registry?q=b48c34df08d16311e3bca06b135b828d

So it's now accessible through any BTE instance using the api-specific endpoints - but it's not used by the team-specific / ara-specific endpoints yet.

colleenXu commented 1 year ago
Here's a TRAPI query for "zinc supplement" -> disease ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["UMLS:C1268859"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } }, "edges": { "e01": { "subject": "n0", "object": "n1" } } } } } ```

Response: suppKG1.txt

An Edge in the response looks like this in the ARAX UI: Screen Shot 2023-08-24 at 11 09 30 PM

colleenXu commented 1 year ago

But....I still want to discuss the "UMLS:DC" IDs with @andrewsu (previous posts here and here), before moving forward.

I'm using an "ulcerative colitis" -> supplement response as my reference: suppkg2.txt

TRAPI query ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["UMLS:C0009324"], "categories": ["biolink:Disease"] }, "n1": { "categories": ["biolink:SmallMolecule"] } }, "edges": { "e01": { "subject": "n0", "object": "n1" } } } } } ```

Analysis

The IDs may be real UMLS IDs, if you remove the "D".

click to see table | UMLS:DC ID | suppKG Name | Real UMLS ID | UMLS Name | notes | |--------|--------|--------|--------|--------| | UMLS:DC0023791 | (r)-dithiolane-3-pentanoic acid | [UMLS:C0023791](https://uts.nlm.nih.gov/uts/umls/concept/C0023791) | thioctic acid | mapped ID names (R)-1,2-Dithiolane-3-pentanoic acid) are very similar to suppKG's name. This is more commonly known as alpha-lipoic acid | | UMLS:DC0026370 | black strap molasses | [UMLS:C0026370](https://uts.nlm.nih.gov/uts/umls/concept/C0026370) | molasses | Hmm blackstrap molasses is a narrower concept | | UMLS:DC0016157 | 1,200 mg | [UMLS:C0016157](https://uts.nlm.nih.gov/uts/umls/concept/C0016157) | fish oils | | | UMLS:DC0014839 | aesculin | [UMLS:C0014839](https://uts.nlm.nih.gov/uts/umls/concept/C0014839) | esculin | suppKG's name is in the mapped ID names | | UMLS:DC0349374 | arerra | [UMLS:C0349374](https://uts.nlm.nih.gov/uts/umls/concept/C0349374) | Cow's milk | ["arerra" is a synonym for fermented milk](https://www.webmd.com/vitamins/ai/ingredientmono-1481/fermented-milk) | | UMLS:DC1141640 | beesnest plant | [UMLS:C1141640](https://uts.nlm.nih.gov/uts/umls/concept/C1141640) | Carrots - dietary | hmm...[Bee's nest-plant is also called wild carrot or Queen Anne's lace](https://plants.ces.ncsu.edu/plants/daucus-carota/) |

The UMLS ID names may match suppKG's associations

The edge for "1,200 mg" (UMLS:DC0016157) actually is about fish oils (UMLS:C0016157), and doesn't mention "1,200 mg"

``` "5a6f30fdb2b0d8703c8d4bc8ff58ef96": { "predicate": "biolink:treated_by", "subject": "MONDO:0005101", "object": "UMLS:DC0016157", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:30489199" ], "value_type_id": "linkml:Uriorcurie" }, { "attribute_type_id": "biolink:supporting_text", "value": [ "Therefore, realizing the need for safer and well tolerable alterative treatment approaches, currently, we evaluated the efficacy of n-3 fatty acids rich fish oil (FO) in the resolution of UC." ] } ], ```

The edge for "fibersol-2" (UMLS:DC0032594) actually is about polysaccharides (UMLS:C0032594), and not fibersol-2

[fibersol-2 is a brand supplement with fiber and maltodextrin, derived from corn](https://www.nowfoods.com/products/supplements/prebiotic-fiber-fibersol-2-powder) But the edge is actually about two different kinds of polysaccharides: * [modified apple polysaccharides](https://pubmed.ncbi.nlm.nih.gov/30572047/) * [RTP aka Rheum Tanguticum Polysaccharide](https://pubmed.ncbi.nlm.nih.gov/23674951/) ``` "3b58d54615751c2a11c4f28660371a6a": { "predicate": "biolink:treated_by", "subject": "MONDO:0005101", "object": "UMLS:DC0032594", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:30572047", "PMID:23674951" ], "value_type_id": "linkml:Uriorcurie" }, { "attribute_type_id": "biolink:supporting_text", "value": [ "Efficacy of co-administration of modified apple polysaccharide and probiotics in guar gum-Eudragit S100 based mesalamine mini tablets: A novel approach in treating ulcerative colitis.", "Our results showed that RTP had significant therapeutic effects on both UC and CD." ] } ], ```

Other analysis: seems okay to use UMLS ID/name but other things are going on

The edge for "arerra" (UMLS:DC0349374) actually mentions cow milk (UMLS:C0349374). but it turns out "arerra" is an obscure name for the supplement

["arerra" is a synonym for fermented milk](https://www.webmd.com/vitamins/ai/ingredientmono-1481/fermented-milk) ``` "7d60ab8033b02610a8209dfd5926be57": { "predicate": "biolink:treated_by", "subject": "MONDO:0005101", "object": "UMLS:DC0349374", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:21525768" ], "value_type_id": "linkml:Uriorcurie" }, { "attribute_type_id": "biolink:supporting_text", "value": [ "Here, we examined the effects of a live Bifidobacterium breve strain Yakult, a probiotic contained in bifidobacteria-fermented milk, and galacto-oligosaccharide (GOS) as synbiotics in UC patients." ] } ], ```

suppKG name + real UMLS name both don't match the paper: entity-resolution issue?

The Edge for beesnest plant (UMLS:DC1141640) isn't about [bee's nest-plant/wild carrot/Queen Anne's lace](https://plants.ces.ncsu.edu/plants/daucus-carota/). It also isn't about the food carrots (Carrots - dietary; UMLS:C1141640). The [paper](https://pubmed.ncbi.nlm.nih.gov/28824631/) is about [Morinda officinalis aka Indian mulberry](https://en.wikipedia.org/wiki/Morinda_officinalis). ``` "3dc0b3a041254bb526b3d75907063109": { "predicate": "biolink:treated_by", "subject": "MONDO:0005101", "object": "UMLS:DC1141640", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:28824631" ], "value_type_id": "linkml:Uriorcurie" }, { "attribute_type_id": "biolink:supporting_text", "value": [ "The results demonstrated that the effects of MORE and MOHRE for the treatment of UC are similar, although there are a few difference on their chemical composition, indicating the hairy root cultured from M." ] } ], ```

colleenXu commented 1 year ago

Note that "moving forward" steps would be:

  • getting infores entries w/ wiki pages for infores:suppkg (primary) and infores:biothings-suppkg (aggregator)
  • making a PR to add to BTE's config file, deploying to dev and CI (not frozen)
andrewsu commented 1 year ago

Per @erikyao 's comment here:

Hi @colleenXu , from SemRep_DS/docs/SemRep_full_fielded_output.txt:

*_CUI: The CUI of the subject/object entity. If a CUI starts with 'DC' instead of just 'C' it is an iDISK CUI and is not present in the UMLS.

It seems like the authors' intent is clear that "DC" IDs are meant to represent concepts for which they find no synonymous UMLS ID. @colleenXu, you've found many examples where it appears that there is a very tight connection between the "DC" ID and the corresponding UMLS ID. However, I don't think we have the time or expertise to be able to evaluate that linking exhaustively. Since the consequence of moving forward as-is is underlinking (rather than inclusion of false assertions, at least beyond the expected rate from a text-mined resource), I think we should go forward with that plan. So please proceed with the next steps you outlined in the preceding comment. Thanks!

colleenXu commented 1 year ago

After discussion with Andrew (8/29?), we agreed to go forward with the DC IDs.

I followed my earlier post of "next steps to deployment":

colleenXu commented 1 year ago

@andrewsu @erikyao

I have another thought on the "DC" terms, but I don't know if @erikyao already investigated this...

Based on Yao's url https://github.com/zhang-informatics/SemRep_DS/blob/main/docs/SemRep_full_fielded_output.txt:

  • SuppKG maybe didn't originally have a prefix for these IDs.
  • The text says "If a CUI starts with 'DC' instead of just 'C' it is an iDISK CUI and is not present in the UMLS."

So I wonder if we'd want these "DC" terms in different fields of the BioThings SuppKG API. Right now, they're in subject.umls and object.umls, which is why x-bte annotation sets BTE up to add the UMLS prefix to these "DC" terms, when they're not UMLS CUIs...


And I was wondering if we know more about the "DC" terms, which may help us decide if they are a different namespace (and if so, what the prefix and other namespace info would be).

  • In a quick look in the suppKG paper, I see "Additionally, because of how DCUIs were assigned in iDISK, it is possible to map DS concepts with DCUIs to UMLS concepts with CUIs." This makes me wonder if (and how many) mappings exist between "DCUIs" and "UMLS CUIs"...and whether this could be added to the BioThings SuppKG API...
  • To understand these "DC" IDs more... this may involve digging thru the SuppKG paper and maybe the iDISK paper referenced
andrewsu commented 1 year ago

After reviewing this again, I think we should move forward with the "quickest path" solution -- keeping the DC IDs under subject.umls and object.umls. Yes, it results in invalid UMLS curies, but I think that's fine for the sake of expediency.

Also just noting for future reference that in the source file, there are 53707 IDs that start with C, and 2928 that start with D.

colleenXu commented 1 year ago

Now being addressed by a different commit https://github.com/biothings/bte-server/commit/58177d37ddb66c52ae3a732aecb0ddfa79257cd4. This is now deployed on dev/CI instances.

See Jackson's post here

colleenXu commented 8 months ago

Closing this issue since the changes have been deployed to Prod with the Feb 2024 release.

I've confirmed that I can query BioThings suppKG through BTE prod https://bte.transltr.io/v1/team/Service Provider/query with the example in https://github.com/biothings/biothings_explorer/issues/706#issuecomment-1692818689 and get the expected response.