biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 10 forks source link

refactor semmeddb SmartAPI annotation to better represent text snippets #833

Open andrewsu opened 1 month ago

andrewsu commented 1 month ago

TMKP represents their text snippets in a way that the UI is able to display them. In contrast, for SemMedDB, the UI only displays the first sentence of the abstract. More analysis on what BTE is doing is in https://github.com/NCATSTranslator/Feedback/issues/625#issuecomment-2226250508, and the TMPK solution is described in https://github.com/NCATSTranslator/Feedback/issues/625#issuecomment-2226378023.

rjawesome commented 1 month ago

I believe this can be done with a JQ wrap template applied on SemMedDB.

andrewsu commented 1 month ago

Great idea, @rjawesome . Though hold off on working on this for a moment. @colleenXu had a chat earlier today while they get some further clarifications on that structure and how the UI consumes it. But the jq templates do seem like a good option when it comes down to implementation!

colleenXu commented 3 weeks ago

@rjawesome @tokebe

For BioThings SEMMEDDB, we want to post-process some of the sub-query response data into a special TRAPI format (sentence/publication info).

Example SEMMEDDB data

https://biothings.ci.transltr.io/semmeddb/association/C0043481-STIMULATES-4780 (where [this](https://github.com/NCATSTranslator/Feedback/issues/625#issuecomment-2226250508) comes from) We want to post-process each element in the `predication` array: keeping the `sentence`, `pmid`, `predication_id` for each element together. Note: * This association has 9 sentences (`predication_count`) from **6 publications** (`pmid_count`) * Two of the sentences are duplicates (`predication.predication_id` 171149564 and 171149565 for pmid 23868099 * Additionally, there are two publications with multiple sentences: * 25994789 * 23536959 ``` { "_id": "C0043481-STIMULATES-4780", ... "pmid_count": 6, ... "predication": [ { "object_score": 720, "object_text": "Nrf2", "pmid": 24597671, "predication_id": 73680403, "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.", "sentence_id": 21797489, "subject_score": 1000, "subject_text": "Zn" }, { "object_score": 1000, "object_text": "Nrf2", "pmid": 25994789, "predication_id": 142544436, "sentence": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation.", "sentence_id": 263309274, "subject_score": 775, "subject_text": "zinc" }, { "object_score": 1000, "object_text": "Nrf2", "pmid": 25994789, "predication_id": 142545021, "sentence": "The increase of intracellular free zinc may be one mechanism for Nrf2 activation.", "sentence_id": 263310520, "subject_score": 802, "subject_text": "zinc" }, { "object_score": 794, "object_text": "Nrf2", "pmid": 16723490, "predication_id": 166335624, "sentence": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD).", "sentence_id": 309601386, "subject_score": 1000, "subject_text": "zinc" }, { "object_score": 1000, "object_text": "Nrf2", "pmid": 23536959, "predication_id": 168073659, "sentence": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta.", "sentence_id": 312795976, "subject_score": 1000, "subject_text": "zinc" }, { "object_score": 861, "object_text": "Nrf2", "pmid": 23536959, "predication_id": 168073663, "sentence": "The aortic protection by zinc against diabetes-induced pathogenic changes is associated with the up-regulation of both MT and Nrf2 expression.", "sentence_id": 312795978, "subject_score": 1000, "subject_text": "zinc" }, { "object_score": 901, "object_text": "Nrf2", "pmid": 23868099, "predication_id": 171149564, "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.", "sentence_id": 318601901, "subject_score": 1000, "subject_text": "zinc" }, { "object_score": 1000, "object_text": "Nrf2", "pmid": 23868099, "predication_id": 171149565, "sentence": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.", "sentence_id": 318601901, "subject_score": 1000, "subject_text": "zinc" }, { "object_score": 618, "object_text": "Nrf2", "pmid": 33198336, "predication_id": 190294446, "sentence": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations.", "sentence_id": 359812408, "subject_score": 618, "subject_text": "Zn" } ], "predication_count": 9, ... ```

For testing, this TRAPI query should only return the example data as 1 TRAPI edge

``` { "message": { "query_graph": { "nodes": { "creativeQuerySubject": { "ids": ["CHEBI:27363"], "categories":["biolink:ChemicalEntity"], "name": "zinc" }, "creativeQueryObject": { "ids": ["NCBIGene:4780"], "categories":["biolink:Gene", "biolink:Protein"], "name": "NFE2L2" } }, "edges": { "eA": { "subject": "creativeQuerySubject", "object": "creativeQueryObject", "predicates": ["biolink:affects"], "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" }, { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" } ] } ] } } } } } ```

Then we'll want modified x-bte annotation

I have modifications stored on this [branch](https://github.com/NCATS-Tangerine/translator-api-registry/tree/semmeddb_publication_refactor/semmeddb). * each operation's `parameter.fields`: changed to grab the whole `predication` contents and `pmid_count`. Find-replace `predication.pmid,predication.sentence` ➡️ `predication,pmid_count` * response-mapping: adjust to use `predication` and `pmid_count`. Will use special key `semmeddb_publication_info` for `predication` field to signal special post-processing. * `pmid_count` can be handled by existing code. Keep value as int! Example: ``` umls-obj: UMLS: object.umls ## no prefix semmeddb_publication_info: predication ## no prefixes on pmids "biolink:evidence_count": predication.pmid_count input_name: subject.name output_name: object.name ```

Then we want to format the SEMMEDDB predication data into TRAPI edge-attributes

Requirements:

Example: First element in `predications` array -> 1 TRAPI edge-attribute

The SEMMEDDB data: ``` { "object_score": 720, "object_text": "Nrf2", "pmid": 24597671, "predication_id": 73680403, "sentence": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.", "sentence_id": 21797489, "subject_score": 1000, "subject_text": "Zn" }, ``` **The TRAPI edge-attribute:** * `predication_id` ➡️ top-level value. Turn it into a string, since it's an ID! * `sentence` ➡️ sub-attribute `biolink:supporting_text` value * `pmid` ➡️ sub-attribute `biolink:publications` value. Add prefix! ``` { "attribute_type_id": "biolink:has_supporting_study_result", "value": "73680403", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function." }, { "attribute_type_id": "biolink:publications", "value": "PMID:24597671" } ] }, ```

5 more unique publications in `predication` array -> 5 more TRAPI edge-attributes

Note that I picked the first element/sentence for the 3 cases where there are multiple sentences (PMIDs 23868099, 25994789, 23536959) ``` { "attribute_type_id": "biolink:has_supporting_study_result", "value": "142544436", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation." }, { "attribute_type_id": "biolink:publications", "value": "PMID:25994789" } ] }, { "attribute_type_id": "biolink:has_supporting_study_result", "value": "166335624", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD)." }, { "attribute_type_id": "biolink:publications", "value": "PMID:16723490" } ] }, { "attribute_type_id": "biolink:has_supporting_study_result", "value": "168073659", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta." }, { "attribute_type_id": "biolink:publications", "value": "PMID:23536959" } ] }, { "attribute_type_id": "biolink:has_supporting_study_result", "value": "171149564", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs." }, { "attribute_type_id": "biolink:publications", "value": "PMID:23868099" } ] }, { "attribute_type_id": "biolink:has_supporting_study_result", "value": "190294446", "attributes": [ { "attribute_type_id": "biolink:supporting_text", "value": "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations." }, { "attribute_type_id": "biolink:publications", "value": "PMID:33198336" } ] } ```


Notes: