Open colleenXu opened 7 months ago
**[EDIT: PAUSE FOR DISCUSSION. NEEDS MORE INVESTIGATION ON WHAT THE SPEC IS, based on [Eric's reply in Translator Slack](https://ncatstranslator.slack.com/archives/C0156TX0VB2/p1711424472142189)]**
Previously, we implemented QNode `is_set` behavior ([original issue](https://github.com/biothings/biothings_explorer/issues/341), [behavior with ID/node-expansion](https://github.com/biothings/biothings_explorer/issues/555)).
* default behavior (property missing or `is_set: false`): each result has 1 KG node bound to each QNode
* but if a QNode has `is_set: true`, a result can have >= 1 KG node bound to that QNode. AKA there's a merging/consolidation.
### Feature
In TRAPI 1.5, `is_set` is replaced with `set_interpretation`, which has more explicit rules for results-assembly ([PR](https://github.com/NCATSTranslator/ReasonerAPI/pull/475/files), lines [881](https://github.com/NCATSTranslator/ReasonerAPI/blob/2b7a0e0a18b1ee97b0ab0c5d6283ab86b83ce12b/TranslatorReasonerAPI.yaml#L881)-896). It's an optional property with string values (enum).
* default behavior (property missing or null) == "BATCH": same as before, each result has 1 KG node bound to each QNode
* "MANY": same as previous `is_set: true` behavior.
* Note: This new specification only seems to cover when QNodes have multiple starting IDs. But I'd like to keep our current use of `is_set:true` / `set_interpretation: MANY` on QNodes with no starting IDs to merge/consolidate results.
* "ALL": **new behavior**. This should only be set on QNodes that have multiple starting IDs/entities. Similar to the "MANY" behavior, but only keep results that contain all starting IDs/entities.
* AKA if only some of the starting IDs/entities are in the consolidated result, it should be thrown out (and any KG nodes/edges unique to it should be pruned).
### Examples
All use the same basic query, just setting `set_interpretation` to different values. I used HP IDs with no descendants (because ID/node-expansion triggers an automatic use of `is_set: true`, see https://github.com/biothings/biothings_explorer/issues/555#issuecomment-1412517755)
Query:
```
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"categories":["biolink:PhenotypicFeature"],
"ids":["HP:0500041", "HP:0007750"],
"set_interpretation": "BATCH"
},
"n1": {
"categories":["biolink:Disease"]
}
},
"edges": {
"eA": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:phenotype_of"]
}
}
}
}
}
```
Response should have **39 results**: [current_default.json](https://github.com/biothings/biothings_explorer/files/14751403/current_default.json). This was generated with current default (not setting `is_set`)
Query:
```
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"categories":["biolink:PhenotypicFeature"],
"ids":["HP:0500041", "HP:0007750"],
"set_interpretation": "MANY"
},
"n1": {
"categories":["biolink:Disease"]
}
},
"edges": {
"eA": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:phenotype_of"]
}
}
}
}
}
```
Response should have **37 results** (rather than 39): [current_is_set.json](https://github.com/biothings/biothings_explorer/files/14751449/current_is_set.json). This was generated with current `is_set: true`.
Query:
```
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"categories":["biolink:PhenotypicFeature"],
"ids":["HP:0500041", "HP:0007750"],
"set_interpretation": "ALL"
},
"n1": {
"categories":["biolink:Disease"]
}
},
"edges": {
"eA": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:phenotype_of"]
}
}
}
}
}
```
Response should have **2 results** (rather than 39 or 37). Only 2 disease entities are connected to both starting entities. See the first two results of [current_is_set.json](https://github.com/biothings/biothings_explorer/files/14751449/current_is_set.json). This was generated with current `is_set: true`.
* MONDO:0009003 (achromatopsia 2)
* MONDO:0013560 (Hermansky-Pudlak syndrome 8)
I expect 11 results for following query (not setting `is_set`), but end up with 10 results which is the same as setting `is_set: true`.
* response to the query below, not setting `is_set`: [10_not_setting_is_set.json](https://github.com/biothings/biothings_explorer/files/14751649/10_not_setting_is_set.json)
* VS response setting `is_set` [10_is_set_true.json](https://github.com/biothings/biothings_explorer/files/14751645/10_is_set_true.json)
```
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"categories":["biolink:PhenotypicFeature"],
"ids":["HP:0007800", "HP:0025586"]
},
"n1": {
"categories":["biolink:Disease"]
}
},
"edges": {
"eA": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:phenotype_of"]
}
}
}
}
}
```
This happens because ID/node-expansion finds a descendant for 1 of the starting IDs. Console logs:
```
bte:biothings-explorer-trapi:main Expanded ids for node n0: (2 ids -> 3 ids) +0ms
bte:biothings-explorer-trapi:main Added is_set:true to node n0 +1ms
```
This query has 130 results whether `is_set: true` or not, when it should have >= 134 results when not. It also has some subclass_of edges/aux-graphs, but I'm not sure if it's a good test for seeing if ID/node-expansion becomes wonky after the changes.
```
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"categories":["biolink:PhenotypicFeature"],
"ids":["HP:0003259", "HP:0000110"]
},
"n1": {
"categories":["biolink:Disease"]
}
},
"edges": {
"eA": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:phenotype_of"]
}
}
}
}
}
```
set to BATCH (default)
set to MANY
set to ALL
Example
Note to self with another example
See notes from 2024-03-27 group meeting.
I have a lot of concerns / questions, and we're going to ask the Translator group for clarification. We'll start with asking about the basic requirements / deadlines.