biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

TRAPI 1.5: set_interpretation/MCQ #800

Open colleenXu opened 5 months ago

colleenXu commented 5 months ago

See notes from 2024-03-27 group meeting.

I have a lot of concerns / questions, and we're going to ask the Translator group for clarification. We'll start with asking about the basic requirements / deadlines.

colleenXu commented 5 months ago
This is all of my initial understanding. There seems to be a LOT more going on, which we'll track in the opening post of this issue.

**[EDIT: PAUSE FOR DISCUSSION. NEEDS MORE INVESTIGATION ON WHAT THE SPEC IS, based on [Eric's reply in Translator Slack](https://ncatstranslator.slack.com/archives/C0156TX0VB2/p1711424472142189)]** Previously, we implemented QNode `is_set` behavior ([original issue](https://github.com/biothings/biothings_explorer/issues/341), [behavior with ID/node-expansion](https://github.com/biothings/biothings_explorer/issues/555)). * default behavior (property missing or `is_set: false`): each result has 1 KG node bound to each QNode * but if a QNode has `is_set: true`, a result can have >= 1 KG node bound to that QNode. AKA there's a merging/consolidation. ### Feature In TRAPI 1.5, `is_set` is replaced with `set_interpretation`, which has more explicit rules for results-assembly ([PR](https://github.com/NCATSTranslator/ReasonerAPI/pull/475/files), lines [881](https://github.com/NCATSTranslator/ReasonerAPI/blob/2b7a0e0a18b1ee97b0ab0c5d6283ab86b83ce12b/TranslatorReasonerAPI.yaml#L881)-896). It's an optional property with string values (enum). * default behavior (property missing or null) == "BATCH": same as before, each result has 1 KG node bound to each QNode * "MANY": same as previous `is_set: true` behavior. * Note: This new specification only seems to cover when QNodes have multiple starting IDs. But I'd like to keep our current use of `is_set:true` / `set_interpretation: MANY` on QNodes with no starting IDs to merge/consolidate results. * "ALL": **new behavior**. This should only be set on QNodes that have multiple starting IDs/entities. Similar to the "MANY" behavior, but only keep results that contain all starting IDs/entities. * AKA if only some of the starting IDs/entities are in the consolidated result, it should be thrown out (and any KG nodes/edges unique to it should be pruned). ### Examples All use the same basic query, just setting `set_interpretation` to different values. I used HP IDs with no descendants (because ID/node-expansion triggers an automatic use of `is_set: true`, see https://github.com/biothings/biothings_explorer/issues/555#issuecomment-1412517755)

set to BATCH (default)

Query: ``` { "message": { "query_graph": { "nodes": { "n0": { "categories":["biolink:PhenotypicFeature"], "ids":["HP:0500041", "HP:0007750"], "set_interpretation": "BATCH" }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "eA": { "subject": "n0", "object": "n1", "predicates": ["biolink:phenotype_of"] } } } } } ``` Response should have **39 results**: [current_default.json](https://github.com/biothings/biothings_explorer/files/14751403/current_default.json). This was generated with current default (not setting `is_set`)

set to MANY

Query: ``` { "message": { "query_graph": { "nodes": { "n0": { "categories":["biolink:PhenotypicFeature"], "ids":["HP:0500041", "HP:0007750"], "set_interpretation": "MANY" }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "eA": { "subject": "n0", "object": "n1", "predicates": ["biolink:phenotype_of"] } } } } } ``` Response should have **37 results** (rather than 39): [current_is_set.json](https://github.com/biothings/biothings_explorer/files/14751449/current_is_set.json). This was generated with current `is_set: true`.

set to ALL

Query: ``` { "message": { "query_graph": { "nodes": { "n0": { "categories":["biolink:PhenotypicFeature"], "ids":["HP:0500041", "HP:0007750"], "set_interpretation": "ALL" }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "eA": { "subject": "n0", "object": "n1", "predicates": ["biolink:phenotype_of"] } } } } } ``` Response should have **2 results** (rather than 39 or 37). Only 2 disease entities are connected to both starting entities. See the first two results of [current_is_set.json](https://github.com/biothings/biothings_explorer/files/14751449/current_is_set.json). This was generated with current `is_set: true`. * MONDO:0009003 (achromatopsia 2) * MONDO:0013560 (Hermansky-Pudlak syndrome 8)

## Complications that need discussion ### (1) ID/node expansion Currently, if we find descendants of a starting ID (ID/node expansion), we set that starting ID's QNode to `is_set: true` https://github.com/biothings/biothings_explorer/issues/555#issuecomment-1412517755. Can we remove this behavior? * the current behavior has unintended consequences, like **being completely unable to do `is_set: false` / `set_interpretation: BATCH` for some queries** * but I'm not sure if we depend on this behavior (we want to keep the behavior of "subclass_of edges using different descendant IDs are kept in the same result") * Context: we implemented this with an old version of representing subclass info ([comment](https://github.com/biothings/biothings_explorer/issues/497#issuecomment-1304368502)). Now we use "constructed edges" + aux-graphs ([issue](https://github.com/biothings/biothings_explorer/issues/603))
Example

I expect 11 results for following query (not setting `is_set`), but end up with 10 results which is the same as setting `is_set: true`. * response to the query below, not setting `is_set`: [10_not_setting_is_set.json](https://github.com/biothings/biothings_explorer/files/14751649/10_not_setting_is_set.json) * VS response setting `is_set` [10_is_set_true.json](https://github.com/biothings/biothings_explorer/files/14751645/10_is_set_true.json) ``` { "message": { "query_graph": { "nodes": { "n0": { "categories":["biolink:PhenotypicFeature"], "ids":["HP:0007800", "HP:0025586"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "eA": { "subject": "n0", "object": "n1", "predicates": ["biolink:phenotype_of"] } } } } } ``` This happens because ID/node-expansion finds a descendant for 1 of the starting IDs. Console logs: ``` bte:biothings-explorer-trapi:main Expanded ids for node n0: (2 ids -> 3 ids) +0ms bte:biothings-explorer-trapi:main Added is_set:true to node n0 +1ms ```

Note to self with another example

This query has 130 results whether `is_set: true` or not, when it should have >= 134 results when not. It also has some subclass_of edges/aux-graphs, but I'm not sure if it's a good test for seeing if ID/node-expansion becomes wonky after the changes. ``` { "message": { "query_graph": { "nodes": { "n0": { "categories":["biolink:PhenotypicFeature"], "ids":["HP:0003259", "HP:0000110"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "eA": { "subject": "n0", "object": "n1", "predicates": ["biolink:phenotype_of"] } } } } } ```

### (2) Unclear what the KG Node `is_set` property is for [Asked in Translator Slack](https://ncatstranslator.slack.com/archives/C0156TX0VB2/p1711422162104869): The [PR](https://github.com/NCATSTranslator/ReasonerAPI/pull/475/files) for `set_interpretation` also adds an `is_set` property to KG Nodes (lines [1011](https://github.com/NCATSTranslator/ReasonerAPI/blob/2b7a0e0a18b1ee97b0ab0c5d6283ab86b83ce12b/TranslatorReasonerAPI.yaml#L1011)-1017). I'm not sure if this is meant to be used, and how (merging KG Nodes??). [Eric's reply in Translator Slack](https://ncatstranslator.slack.com/archives/C0156TX0VB2/p1711424472142189) - needs more investigation. ### (3) Clarifying an edge case [Asked in Translator Slack](https://ncatstranslator.slack.com/archives/C0156TX0VB2/p1711422339954539): If `set_interpretation` is set on a QNode with multiple starting IDs, but these IDs all map to the same entity (using NodeNorm), then there isn't any set behavior to do. Is that fine? Does there need to be any log noting this?