Open tokebe opened 8 months ago
@colleenXu Please review and let me know if this aligns with your understanding and covers all the bases. Additionally, let me know if this explanation is sufficient for you to work on an example template.
Please note that I'm currently asking for clarification regarding the 3rd example question.
I have some feedback on the "Important Considerations". I think it'll be helpful for @tokebe and I to discuss...
I'm unsure of the assumption that the inferred-mode-handler will produce 1 mega-result with many support-graphs after running the templates.
A template (n0 -> inter_1 -> n2) can return > 1 result if the intermediate nodes aren't set to is_set: true
- which is what I plan to do. In this situation, there'll be separate results for each unique set of intermediate nodes, which is kinda partway to the desired output…
Then I'm not sure if the inferred-mode-handler logic will continue to keep those results separate vs merge them into 1 mega-result…
I'm confused on how the number of support-graphs relates to the number of final-formatted results, ex: point 3's "count the number of support graphs, multiplying each by the number of intermediate nodes"
I assumed that there'd be 1 final-formatted result per unique intermediate node…so if that intermediate node was in multiple template results (aka diff support-graphs), those would all be put together into that final-formatted result.
In this situation + our current subclassing code, the subclasses of n0/n2 entity IDs will count as intermediate nodes. Is this a problem / an issue to ask Translator about? I'm not sure if the other teams have implemented this subclassing feature and will encounter this…
I'm not sure on the last point, because I thought the e2 support-graph would still differ between each final-formatted result. It sounds like the e2 support-graph is basically the union of the edge sets in the e0 and e1 subgraphs. AKA it's still a subgraph containing n0, 1 specific intermediate, and n2.
Regarding point 1 of the "Important Considerations", I'm going to review the problem again to see if it's still relevant...
Responding to @colleenXu's feedback:
Jackson @tokebe: here's some slides based on our discussions of this pathfinder prototype so far. It's editable, so you should be able to adjust things. This should be useful for discussions, including with @rjawesome.
On point (2):
Here's some stuff that came out of our 1-on-1 discussion today:
@tokebe @rjawesome
I'm putting the pathfinder template-groups and templates here: https://github.com/biothings/bte_trapi_query_graph_handler/tree/pathfinder-templates/data
[EDIT: The notes below aren't using the potential answers Sui posted]
* assuming the input curies are PUBCHEM.COMPOUND:5291 (imatinib), [MONDO:0004979](https://monarchinitiative.org/MONDO:0004979) (asthma) * but I'm seeing some discussion of "allergic asthma". Which may be [MONDO:0004784](https://monarchinitiative.org/MONDO:0004784). * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878837/CaseA_template1.json): gene intermediate. Runs in 1 min 30s, 684 results, top result is KIT * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878862/CaseA_template2.json): gene + cell intermediates. Runs in 51s, 380 results, top result is KIT + mast cell * 3rd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14881778/CaseA_template3.json): gene + PhysiologicalProcess/pathway intermediates. Runs in 5 min 33s, 1876 results. Results include previously interesting PhysiologicalProcess intermediate nodes like "immune response", "bronchoconstriction", "cytokine production"
* assuming the input curies are PUBCHEM.COMPOUND:445154 (Resveratrol), NCBIGene:2739 (glyoxalase, GLO1) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878894/CaseB_template1.json): 1 gene intermediate. Runs in 28 s, 84 results, result 3 is NFE2L2 (NCBIGene:4780). Didn't see any MAPK. * 2nd template (response too large to attach): 2 gene intermediates. Runs in 1 min 1 s, 4472 results. Result 10 includes NFE2L2. Result 116 (or around there) is a possible answer (Resveratrol ➡️ MAPK1 (ERK2, NCBIGene:5594) ➡️ NFE2L2 ➡️ GLO1). **Possible answer notes** * From [ref](https://pubs.acs.org/doi/10.1021/jf302831d) ([Translator Slack](https://ncatstranslator.slack.com/archives/C06HL343420/p1707263932635189)): Resveratrol ➡️ ERK ➡️ Nrf2 ➡️ genetic element "antioxidant response element" ➡️ glyoxalase * ERK (extracellular signal-regulated kinase) genes or proteins: * MAPK3 (NCBIGene:5595) aka ERK1 * MAPK1 (NCBIGene:5594) aka ERK2 * [NFE2L2](https://en.wikipedia.org/wiki/NFE2L2) gene [NCBIGene:4780](https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=4780&rn=1) == Nrf2 protein ([nuclear factor erythroid 2-related factor 2](https://en.wikipedia.org/wiki/NFE2L2)). * BUT there'll be text-mining related confusion >.<. There's another Nrf2 protein (nuclear respiratory factor 2 == GABPA gene) that's also a transcription factor. ([ref](https://pubmed.ncbi.nlm.nih.gov/23597778/), was linked by [this paper section 2.1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6888570/)) * genetic feature "antioxidant response element": [SRI NameResolver](https://name-lookup.ci.transltr.io/lookup?string=antioxidant%20response%20element&offset=0&limit=10) says there's a MESH ChemicalEntity term for this. * From [Translator Slack](https://ncatstranslator.slack.com/archives/C06HL343420/p1707257003050949): * RELA gene (NCBIGene:5970) == Transcription factor p65 also known as nuclear factor NF-kappa-B p65 subunit protein * **Negative controls?** * Look for MAPK8-14. [Ref](https://pubs.acs.org/doi/10.1021/jf302831d) said it wasn't using [p38](https://en.wikipedia.org/wiki/P38_mitogen-activated_protein_kinases) (MAPK11-14) or c-Jun N-terminal kinase (JNK) pathways (MAPK8-10)
* assuming the input curies are MONDO:0005011 (Crohn disease) and MONDO:0005180 (Parkinsons) * possible answers: * [LRRK2](https://www.nature.com/articles/s41531-021-00170-1) * but also, [it's not really clear if there's a genetic link](https://www.nature.com/articles/s41531-022-00318-7) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14879166/CaseC_template1.json): SequenceVariant intermediate: Runs in 16 s, 1 result [rs2066842 in gene NOD2](https://www.ncbi.nlm.nih.gov/snp/rs2066842) * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14879193/CaseC_template2.json): Gene intermediate. Runs in 1 min 55 s, 392 results, 1st result is LRRK2.
* assuming the input curies are NCBIGene:54716 (SLC6A20), [MONDO:0100096](https://monarchinitiative.org/MONDO:0100096) (COVID19) * SLC6A20 is the gene name, VS the protein product has multiple names (SIT1, sodium-dependent Imino Transporter 1 / System IMINO transporter, XTRP3) * possible answers: * ACE2 (gene) - [2nd paragraph of this paper](https://www.nature.com/articles/s41421-023-00596-2) * [glycine](https://www.degruyter.com/document/doi/10.1515/bmc-2021-0017/html) (amino acid, chemical entity) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878738/CaseD_template1.json): 1 unconstrained intermediate. Runs in 2 min 26s, 38 results, top result is ACE2. Glycine is also there, as last result. * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878784/CaseD_template2.json): chemical + gene/protein intermediates. Runs in 1 min 30s, 186 results. 6th result is the possible answer SLC6A20 ➡️ glycine ➡️ ACE2 ➡️ COVID19. Other results include glycine (1st, 3rd) or ACE2 (11th).
Template guidelines:
is_set: true
: generate separate results per unique node-collectionso one way to restate it is: for each template run, record (number of results) * (number of intermediate nodes in that template's QGraph, will probably be 1 or 2). Sum these together as templates are run. Stop when this number >= 500.
@colleenXu Agreed, this is a good heuristic.
@rjawesome Please note that we've finalized our expectation for how "final" results should be generated when iterating over template results (as inferred-handler result support graphs). This is detailed on slides 9-13 in the above linked slides.
@rjawesome @tokebe
I have an update on point 1 of the "Important Considerations": I can't recreate the buggy behavior, so maybe things are fine?
The previous buggy behavior was: I set up a Pathfinder TRAPI query with two starting IDs/nodes that we shouldn't find any results for - but instead results were returned that connected only to the first starting ID/node. (ref: lab Slack convo starting here)
1. Check out the query-handler pathfinder-templates branch, which also has the check on QNode IDs modified https://github.com/biothings/bte_trapi_query_graph_handler/commit/8deeada313e8547b43acb2074a95f6027eaa85e2 ([commit](https://github.com/biothings/bte_trapi_query_graph_handler/commit/7fe8c27d96b7303d7cd79126031b6f836b1caf83) originally from @tokebe's inferred-explain branch). Be on the main branches for all other modules. 2. Adjust the query-handler `data/templateGroups.json` array so it only contains the `"Pathfinder: Drug-Disease"` template-group object. 3. Setup BTE in "CI" mode (`pnpm build`, `API_OVERRIDE=true INSTANCE_ENV=ci pnpm run smartapi_sync`, then I use `INSTANCE_ENV=ci USE_THREADING=false pnpm start`) 4. Run this query: this is basically what the "artificial query graph" that goes into the inferred-mode-handler will look like = 1 QEdge with predicate related_to and knowledge_type inferred + both QNodes set to IDs. 5. BTE finds NO results which is the expected behavior. ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["PUBCHEM.COMPOUND:5291"], "categories":["biolink:ChemicalEntity"], "name": "imatinib" }, "n1": { "ids":["MONDO:0011821"], "categories":["biolink:DiseaseOrPhenotypicFeature"], "name": "Meckel syndrome, type 3" } }, "edges": { "e0": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```
But I did hit another bug, which didn't halt execution. I'll open another issue for it.
Functionality should be finished in the pathfinder
branch of bte_query_graph_handler repo.
I did want to add in a few tests / check over the code a bit more before making a PR. However, it would be a good idea to make sure that the functionality is implemented correctly.
Current test query that I have been using
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"PUBCHEM.COMPOUND:5291"
],
"categories": ["biolink:Drug"]
},
"un": {
"categories": [
"biolink:NamedThing"
]
},
"n2": {
"ids": [
"MONDO:0004979"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "un",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e1": {
"subject": "un",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e2": {
"subject": "n0",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
}
}
}
}
}
@colleenXu @rjawesome I've done a brief code review of the branch, the execution looks pretty straightforward and good to me, so it's on to testing.
There are a couple of notes, which might be better discussed in a draft PR:
scores
attribute defined on interface CreativePathfinderResponse
? I don't see additional references to it in the branch, and IIRC that wouldn't be proper TRAPI.parse
loop) where you've indented quite far due to nested if/else branches. The readability can be improved using condition guards to avoid nesting where practical.@rjawesome @tokebe
I checked out the pathfinder branch and I can't successfully build. Perhaps the issue is that this branch isn't merged with the latest main?
``` @biothings-explorer/query_graph_handler:build: > @biothings-explorer/query_graph_handler@1.18.0 build /Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler @biothings-explorer/query_graph_handler:build: > tsc -b @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: src/batch_edge_query.ts:1:20 - error TS2614: Module '"@biothings-explorer/call-apis"' has no exported member 'RedisClient'. Did you mean to use 'import RedisClient from "@biothings-explorer/call-apis"' instead? @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: 1 import call_api, { RedisClient } from '@biothings-explorer/call-apis'; @biothings-explorer/query_graph_handler:build: ~~~~~~~~~~~ @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: Found 1 error. @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: ELIFECYCLE Command failed with exit code 1. @biothings-explorer/query_graph_handler:build: ERROR: command finished with error: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1) @biothings-explorer/query_graph_handler#build: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1) Tasks: 9 successful, 10 total Cached: 9 cached, 10 total Time: 7.4s Failed: @biothings-explorer/query_graph_handler#build ERROR run failed: command exited (1) ELIFECYCLE Command failed with exit code 1. ```
@rjawesome @tokebe
I've added support for example/cases 2 (chem - gene) and 3 (disease - disease):
That branch is also has an update to one of the earlier templates (commit) AND is merged with the latest main.
@rjawesome You'll have to pull in the latest from the main
branch and fix any merge conflicts
@colleenXu The main branch should be merged now, which fixes the RedisClient error. Also, the new templates from pathfinder-templates
have been merged into pathfinder
@rjawesome: @colleenXu and I ran some testing on the imatinib-asthma example, and we're seeing some odd behavior:
e0
appears to be actually in the proper format for e2
, while e0
shouldn't contain the hop un->n2
. Every e1
appears to have a single, empty aux graph (where it should instead contain anything supporting the hop from un->n2
). Meanwhile, every e2
is the same edge with hundreds of aux graphs, where it should be exactly as e0
currently appears (a unique e2
per result, with one aux graph).@rjawesome
This goes with Jackson's comment above. I think it's easiest to understand visually w/ screenshots. I'm comparing the pathfinder run to running just the template that it's using. Here's the full response jsons for both, which I viewed in a json-viewer and in ARAX-UI (import -> response):
(Thankfully, this example query is pretty simple: 1 template ran, this template provides unique, single intermediate nodes in each result. So there's a 1-to-1 match between final pathfinder results and the template's results)
This is the bottom result for the template. This intermediate node (FBLN5, NCBIGene:10516) should be removed from the KG, as well as the stuff associated with it (edges + aux-graphs that are unique to this intermediate node's pathfinder result, both the original template stuff and the pathfinder-constructed stuff). ![Screen Shot 2024-04-04 at 2 17 09 PM](https://github.com/biothings/biothings_explorer/assets/43731687/3a515ff6-ce95-4cbc-82fb-bc2952f7053e) But they're still there in pathfinder-response: A KG Node ![Screen Shot 2024-04-04 at 2 20 53 PM](https://github.com/biothings/biothings_explorer/assets/43731687/5a33cfb4-518b-4960-b7b8-1018d87f85a2) A normal edge (from template) ![Screen Shot 2024-04-04 at 2 21 18 PM](https://github.com/biothings/biothings_explorer/assets/43731687/746c5451-1a54-4f1a-910f-8db7ec672cbd) Pathfinder edges and aux-graphs ![Screen Shot 2024-04-04 at 2 25 01 PM](https://github.com/biothings/biothings_explorer/assets/43731687/a19725fa-8fe1-417e-9879-4b3d896f87af) ![Screen Shot 2024-04-04 at 2 25 24 PM](https://github.com/biothings/biothings_explorer/assets/43731687/d6512f0e-464b-446d-a574-b0ce3fa33183)
This is showing the first template's result, with the intermediate node KIT (NCBIGene:3815). We expect e0's support-graph to include all the edges from imatinib (n0) to KIT (un), e1 to include the edge from KIT to asthma (n2), and e2 to include all the edges in this result. ![Screen Shot 2024-04-04 at 2 28 04 PM](https://github.com/biothings/biothings_explorer/assets/43731687/83335b1c-937f-4109-9e55-bc977ec3295b) So then we look at the first pathfinder result... ![Screen Shot 2024-04-04 at 2 32 05 PM](https://github.com/biothings/biothings_explorer/assets/43731687/f60d7696-8735-408c-b928-91293f7b0c11) e0 has all the edges in the result (which is what we wanted for e2) ![Screen Shot 2024-04-04 at 2 33 12 PM](https://github.com/biothings/biothings_explorer/assets/43731687/fb3bac3d-a4d0-446a-85b4-1bd39cbd8e03) e1 has no edges (empty array) ![Screen Shot 2024-04-04 at 2 34 08 PM](https://github.com/biothings/biothings_explorer/assets/43731687/c709053f-040b-4cd0-9109-442408667d72) ![Screen Shot 2024-04-04 at 2 34 28 PM](https://github.com/biothings/biothings_explorer/assets/43731687/34716440-60e1-4880-92e4-5253358fab65) e2 has a ton of support graphs ![Screen Shot 2024-04-04 at 2 35 16 PM](https://github.com/biothings/biothings_explorer/assets/43731687/7beb0917-8b17-4751-85b1-33e33de8222a)
Pruning has been added to pathfinder. Intermediate edges (e0
/e1
from Jackson's test) and main edge (e0
from Jackson's test) have been updated so their auxiliary graphs should be correct now.
@rjawesome @tokebe
I've added support for the last example 4/D (gene - disease):
Should I make template adjustments directly in the pathfinder
branch from now on?
Should I make template adjustments directly in the
pathfinder
branch from now on?
@colleenXu Yes, I think that makes sense. It shouldn't cause any merge issues with any work done to code in the branch.
@rjawesome I've reviewed your changes and each result edge looks nearly correct now. I see only one remaining problem -- the now correctly-aux-graph'd e2
has its subject and object as n0->un
when it should be n0->n2
(even though the edge is bound correctly in the result).
e2
's aux graph has now been fixed. I've also added some more tests around this behavior.
@tokebe @rjawesome
I think we are preserving the support-graph info for subclass-edges correctly.
However, it's not showing up properly in the ARAX-UI. This is happening both for our "normal" creative-mode and our pathfinder responses. It's odd because I recall this stuff showing up properly in the past.
[Saved response](https://github.com/biothings/biothings_explorer/files/14892715/normal-creative-Acanthosis-nigricans.json) from running "treats"-creative mode for MONDO:0007035 (Acanthosis nigricans). The 4th result has a top-level creative-support-graph. ![Screen Shot 2024-04-05 at 11 53 07 PM](https://github.com/biothings/biothings_explorer/assets/43731687/bce89773-5ee7-4934-a1f2-1ea605832feb) When I go into that support-graph and then look at the pheno edges, all should have support-graphs based on their IDs. Instead, no info is shown - not even source info. ![Screen Shot 2024-04-05 at 11 52 17 PM](https://github.com/biothings/biothings_explorer/assets/43731687/b9495311-c41a-4529-8473-37baf472dbae)
The 5th result in the [template run](https://github.com/biothings/biothings_explorer/files/14878837/CaseA_template1.json) is PDGFRA. When you look at that template's run in ARAX-UI, you can see the support-graph/source info for one of the PDGFRA->asthma edges.
![Screen Shot 2024-04-05 at 11 11 49 PM](https://github.com/biothings/biothings_explorer/assets/43731687/249b2e63-6c55-429e-b772-29e1af58eba4)
But if you look at 5th pathfinder result in ARAX-UI [(saved response)](https://github.com/biothings/biothings_explorer/files/14892656/CaseAPathfinder.json), that same edge now doesn't show any info.
![Screen Shot 2024-04-05 at 11 20 34 PM](https://github.com/biothings/biothings_explorer/assets/43731687/49723765-fee0-40a2-b2dd-4fd76a197188)
When I dig into the pathfinder json, all the info for this subclass-edge/its linked support-graph seems to exist and be properly formatted.
```
"NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": {
"predicate": "biolink:gene_associated_with_condition",
"subject": "NCBIGene:5156",
"object": "MONDO:0004979",
"attributes": [
{
"attribute_type_id": "biolink:support_graphs",
"value": [
"support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass"
]
}
],
"sources": [
{
"resource_id": "infores:biothings-explorer",
"resource_role": "primary_knowledge_source"
}
]
},
```
```
"support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": {
"edges": [
"13aa493dafd322cb77c438173de6abd4",
"expanded-MONDO:0005405-subclass_of-MONDO:0004979"
]
},
```
Gene to subclass-disease
```
"13aa493dafd322cb77c438173de6abd4": {
"predicate": "biolink:gene_associated_with_condition",
"subject": "NCBIGene:5156",
"object": "MONDO:0005405",
"attributes": [
{
"attribute_type_id": "biolink:publications",
"value": [
"PMID:16804324"
],
"value_type_id": "linkml:Uriorcurie"
}
],
"sources": [
{
"resource_id": "infores:disgenet",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:mydisease-info",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:disgenet"
]
},
{
"resource_id": "infores:biothings-explorer",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:mydisease-info"
]
}
]
},
```
subclass-disease to main-disease
```
"expanded-MONDO:0005405-subclass_of-MONDO:0004979": {
"predicate": "biolink:subclass_of",
"subject": "MONDO:0005405",
"object": "MONDO:0004979",
"attributes": [],
"sources": [
{
"resource_id": "infores:mondo",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:biothings-explorer",
"resource_role": "aggregator_knowledge_source"
}
]
},
```
subclass-disease node exists as well
```
"MONDO:0005405": {
"categories": [
"biolink:Disease"
],
"name": "childhood onset asthma",
"attributes": [
{
"attribute_type_id": "biolink:xref",
"value": [
"MONDO:0005405",
"DOID:0080815",
"UMLS:C0264408",
"MEDDRA:10081274",
"SNOMEDCT:233678006"
]
},
{
"attribute_type_id": "biolink:synonym",
"value": [
"childhood onset asthma",
"childhood-onset asthma",
"Childhood asthma"
]
}
]
},
```
The subclass edge
the subclass support-graph
The support-graph's edges + subclass-disease node
This post will be recording what tests I'm running, the response-jsons, basic response stats, and other notes. I'll raise errors/problems in separate comments.
Different starting query topologies (does it correctly throw error or continue execution): * Only two edges (correct error) * Different edge directions (correct error) * Different node/edge labels (correct continues execution) * Don't include categories on starting ID nodes (correct continues execution) imatinib -> Meckel syndrome, type 3 (MONDO:0011821): (chem - disease) NEGATIVE CONTROL from [previous comment](https://github.com/biothings/biothings_explorer/issues/794#issuecomment-2014500106) * runs in 29s * 0 results! after running all 3 templates.
Noting my possible answers and Sui's possible answers.
Case A (asthma) is an example of truncating the 1st template's results to get a 500 result set.
Case A (allergic asthma) and D have results/intermediate nodes that were found in multiple templates (showing that the merging code worked as-intended).
imatinib (PUBCHEM.COMPOUND:5291) -> asthma (MONDO:0004979) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892656/CaseAPathfinder.json): * runs in 2 min 11s * 500 results * Only runs 1st template and prunes extra template results * found Sui's possible answers * KIT: top result * SCF (aka KITLG, KIT ligand): 359th result imatinib -> allergic asthma (MONDO:0004784) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892782/CaseAPathfinder-allergicAsthma.json): * runs in 1 min 38s * 419 results * Runs all 3 templates, results only from 1st and third. Doesn't prune any template results. * found Sui's possible answers * KIT: 15th result * SCF (aka KITLG, KIT ligand): 279th result * found my possible answers * immune response: 3rd result
Resveratrol (PUBCHEM.COMPOUND:445154) -> glyoxalase, GLO1 (NCBIGene:2739) [(saved response)](https://github.com/biothings/biothings_explorer/files/15014937/CaseBPathfinder_simple.json) * runs in 32s * 84 results * Runs only 1 simple template * found 1 of Sui's possible answers * NFE2L2: 3rd result
Crohn Disease (MONDO:0005011) -> Parkinson Disease (MONDO:0005180) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892832/CaseCPathfinder.json) * runs in 2 min 22s * 393 results * Runs both templates, results from both. Doesn't prune any template results. * found all Sui's possible answers? (I'm not sure if Sui meant MOD2 gene or NOD2 gene. We have NOD2 variant rs2066842 as top result + NOD2 gene as 7th result) * LRRK2: 2nd result * PARK7: 3rd result
SLC6A20 (NCBIGene:54716) -> COVID19 (MONDO:0100096) [(saved response)](https://github.com/biothings/biothings_explorer/files/14939017/CaseDPathfinder.json) * runs in 4 min 7s * 116 results * Runs both templates, results from both. Doesn't prune any template results. * found Sui's possible answers * ACE2: top result (graph includes glycine) * CXCL8: result 9 * found my possible answers * glycine: 3rd result
@rjawesome @tokebe
A problem: pathfinder doesn't find templates for Case B (chem - gene). I'm not sure what's going on.
``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["PUBCHEM.COMPOUND:445154"], "categories":["biolink:ChemicalEntity"], "name": "Resveratrol" }, "un": { "categories": ["biolink:NamedThing"] }, "n2": { "ids": ["NCBIGene:2739"], "categories":["biolink:Gene"], "name": "glyoxalase, GLO1" } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```
@colleenXu I'll be working on the pathfinder prototype this week as Rohan is unavailable.
Regarding ARAX UI problems, that might be worth reporting to them -- otherwise it's a good note that we should trust our own JSON analysis first.
I'll take a look into the Case B issue.
@tokebe Whoops I didn't set the pathfinder flag on the Case B template group. Added this in a recent commit. Haven't analyzed the behavior yet though.
There's still a problem running Case B. The 2nd template runs quickly (1 min 1s), but returns a lot of results (4472). Inferred-mode then seems to get stuck on "merging" all of the results into 1 mega-result/creative-edge - it may take ~ 1 hour? And then Pathfinder also seems to get stuck finding the intermediate nodes (I didn't wait for it to complete).
I was thinking of Case B as testing multiple things that don't happen with the other cases:
@tokebe For tomorrow's deployments, I've made a branch pathfinder-simpleCaseB that doesn't use the 2nd chem-gene template. BTE will then successfully run the chem-gene example (CaseB) - but it won't find much.
Case B should be fixed. There was an unnecessary while loop that was causing the issues in the inferred mode handler. For the intermediate nodes, the "paths" involved were getting too long so I changed it so each "path" will only use edges from one template result (ie. each path will only include one pair of intermediate genes), but each intermediate node will merge all the "paths" that include it). Previously the paths were getting too long by combining many edges from different template results.
Note:
In the Translator Architecture 4/23 call, the UI team said they'll handle "4-hop paths" (aka 4 edges long).
I think we'll stay at/under that limit with our current Pathfinder templates. All are 2-3 QEdges long.
There's 1 potential case where BTE would generate 5-edge paths: if it ran the 2nd/3rd "Chem-Disease" templates (3 QEdges) and results involved descendants of both the chemical and the disease starting-ID (+2 subclass_of
edges). However, I think it's relatively rare for us to do subclass-expansion on chemical starting-IDs.
@rjawesome Does your optimization change the output at all?
It basically just limits the length of result "paths," so it doesn't compute graphs that have more hops then what is specified in the template (excluding subclass hops).
So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?
If not, can you briefly describe the steps in your current implementation, comparing them to the approach in the slides?
So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?
Yes.
@tokebe @rjawesome
I think there's a problem! The new code is giving different output with less results, missing KG edges, and different scores.
I saw this with Case A allergic asthma:
``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["PUBCHEM.COMPOUND:5291"], "categories":["biolink:ChemicalEntity"], "name": "imatinib" }, "un": { "categories": ["biolink:NamedThing"] }, "n2": { "ids": ["MONDO:0004784"], "categories":["biolink:DiseaseOrPhenotypicFeature"], "name": "allergic asthma" } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```
Here's what I found when digging in:
``` bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +0ms bte:biothings-explorer-trapi:inferred-mode pruned 75 nodes, 246 edges, 0 auxGraphs from combinedResponse. +4ms bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Performing search for intermediate nodes. +2m bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Pathfinder found 344 intermediate nodes and created 1032 support graphs. +28ms bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +32ms bte:biothings-explorer-trapi:inferred-mode pruned 0 nodes, 1 edges, 386 auxGraphs from combinedResponse. +7ms ```
Previous run: ![Screen Shot 2024-05-01 at 11 08 49 PM](https://github.com/biothings/biothings_explorer/assets/43731687/e656e823-53c0-45e8-a628-05592a68bb8e) Current run: ![Screen Shot 2024-05-01 at 11 08 42 PM](https://github.com/biothings/biothings_explorer/assets/43731687/2a79dfbc-a55f-4d37-8af2-5d00c1b2f000)
I accidentally introduced a bug when speeding up the while loop that assigned support graph suffixes in the inferred mode handler. Should be fixed now.
It looks good!
EDIT: First, I've reran all the "working" cases (not Case B).
First I reran all the "working" cases (not Case B).
For Case A allergic asthma (new saved response), I now see the same number of results (419), KG nodes and edges, and aux-graphs as before. And for all the cases, the interesting results from before are still present.
I see some differences between the runs now and the previous runs, but I think these are okay:
Some cases ran faster than before:
Other cases ran slower than before:
@tokebe @rjawesome
Something else is going on with Pathfinder and Case B, and I can't tell if it's okay or a sign of a truncation problem.
The good news is that it now ran both templates in 2 min 16 s (much better than running forever!). As a reminder, the second template returns >4000 results (>1000 nodes and >7000 edges) that needs truncating.
Here's a Google Drive folder w/ my Case B Pathfinder run and the an old run of the 2nd template I'm comparing it to (it's not an exact match to the Pathfinder's 2nd template run, but I think it's close enough for what I want to demonstrate).
What I'm seeing: while there's only 500 results in the Pathfinder run...
NCBIGene:821
CANX from the 4001th result in the old template 2 runNCBIGene:7266
DNAJC7 from the 4005th result in the old template 2 runI didn't notice any truncation issues for Case A asthma and Case C (see my previous notes).
There should definitely be a large number of nodes and edges that aren't bound to a result directly -- we'd expect a lot of nodes that are exclusively bound to an edge used in a support graph for an edge bound to a result, which could leave a lot of extra nodes and edges that don't have an immediately obvious reason for existing.
It could still be the case that there are nodes and edges that aren't properly truncated, I think the only way we can meaningfully check this is by writing a script that parses a response and checks that every node/edge somehow links (directly or indirectly) to a result. It would have to start with results and then work its way out to build out lists of bound edges/nodes/support graph IDs, and then check those lists against the actual KG and support graph set. @rjawesome could you put together such a script? We'd probably want to adapt it to an integration test later, so it would see use beyond just checking this one time.
I added a test here for pathfinder in particular: https://github.com/biothings/bte_trapi_query_graph_handler/blob/894bbb0e53148035ab73cd44ca4f22e3af5e6fb1/__test__/unittest/pathfinder.test.ts#L103-L146
If pfResponse
was to read from a file, then this could function as a "script" to check any given TRAPI response
Did some messing around with @rjawesome's test to make a script and was able to confirm that yes, pruning is working as expected. Case B just creates huge support graphs which results in many many edges.
One priority for the current Translator sprint is a working Pathfinder prototype. This prototype must satisfy a specific input/output format, and should return adequate results for 4 example queries.
Problem Overview
Query format
```json { "message": { "query_graph": { "nodes": { "n0": { "ids": [ "some:CURIE" ] }, "un": { "categories": [ "biolink:NamedThing" ] }, "n2": { "ids": [ "some:CURIE" ] } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" } } } } } ```The result format roughly matches the input format; 3 primary edges, with the two "pinned" query nodes and some intermediate node, and each edge being "artificial", with an associated support graph, as in preset inferred-mode queries.
Example result
```json { "node_bindings": { "n0": [{"id": "n0_pinned_node"}], "un": [{"id": "some_intermediate_node"}], "n2": [{"id": "n2_pinned_node"}] }, "analyses": [ { "resource_id": "infores:biothings-explorer", "edge_bindings": { "e0": [{"id": "inferred-n0-related_to-un"}], // has support graph "e1": [{"id": "inferred-un-related_to-n2"}], // has support graph "e2": [{"id": "inferred-n0-related_to-n2"}] // has support graph }, "score": 1 } ] } ```Our 4 example queries are as follows:
The important differences are that:
Explaining further, for every intermediate node between the two pinned nodes, BTE must generate a result with that intermediate node as the unpinned node, and support graphs for edges on either side representing the rest of the path on either side of that node, as well as the "overall" edge having a support graph representing the full path.
This does mean that BTE will be generating many "redundant" results which bind essentially the same information (aside from the unpinned node) in different "view-frames".
Approach
In order to approach this problem within BTE's existing system, several steps must occur in query execution:
templateGroups
file with the flag"pathfinder": true
and ensuring that flag is checked when obtaining Pathfinder templates.Important Considerations
These steps should be fairly straightforward to implement, with a few complications:
e2
edge (and associated support graph) for each result a given answer path generates.