biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

Pathfinder Prototype #794

Open tokebe opened 4 months ago

tokebe commented 4 months ago

One priority for the current Translator sprint is a working Pathfinder prototype. This prototype must satisfy a specific input/output format, and should return adequate results for 4 example queries.

Problem Overview

Query format ```json { "message": { "query_graph": { "nodes": { "n0": { "ids": [ "some:CURIE" ] }, "un": { "categories": [ "biolink:NamedThing" ] }, "n2": { "ids": [ "some:CURIE" ] } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": [ "biolink:related_to" ], "knowledge_type": "inferred" } } } } } ```

The result format roughly matches the input format; 3 primary edges, with the two "pinned" query nodes and some intermediate node, and each edge being "artificial", with an associated support graph, as in preset inferred-mode queries.

Example result ```json { "node_bindings": { "n0": [{"id": "n0_pinned_node"}], "un": [{"id": "some_intermediate_node"}], "n2": [{"id": "n2_pinned_node"}] }, "analyses": [ { "resource_id": "infores:biothings-explorer", "edge_bindings": { "e0": [{"id": "inferred-n0-related_to-un"}], // has support graph "e1": [{"id": "inferred-un-related_to-n2"}], // has support graph "e2": [{"id": "inferred-n0-related_to-n2"}] // has support graph }, "score": 1 } ] } ```

Our 4 example queries are as follows:

The important differences are that:

  1. There are 3 inferred edges each with support graphs, rather than 1 as in previous inferred-mode queries,
  2. For every intermediate node in an "answer" to the query, BTE must generate a result.

Explaining further, for every intermediate node between the two pinned nodes, BTE must generate a result with that intermediate node as the unpinned node, and support graphs for edges on either side representing the rest of the path on either side of that node, as well as the "overall" edge having a support graph representing the full path.

This does mean that BTE will be generating many "redundant" results which bind essentially the same information (aside from the unpinned node) in different "view-frames".

Approach

In order to approach this problem within BTE's existing system, several steps must occur in query execution:

  1. Recognize a Pathfinder query Recognize this specific query structure and enter a specific query execution mode/control-flow
  2. Select templates Select templates separately from the existing templates. This may be accomplished by registering templates in the templateGroups file with the flag "pathfinder": true and ensuring that flag is checked when obtaining Pathfinder templates.
  3. Execute templates Fill out these templates and execute them in the normal inferred-mode way, resulting in a merged result set of each template.
  4. Special results formatting Iterate over the existing result support graphs to generate a new results set with proper bindings and structure.

Important Considerations

These steps should be fairly straightforward to implement, with a few complications:

tokebe commented 4 months ago

@colleenXu Please review and let me know if this aligns with your understanding and covers all the bases. Additionally, let me know if this explanation is sufficient for you to work on an example template.

tokebe commented 4 months ago

Please note that I'm currently asking for clarification regarding the 3rd example question.

colleenXu commented 4 months ago

I have some feedback on the "Important Considerations". I think it'll be helpful for @tokebe and I to discuss...

(1)

I'm unsure of the assumption that the inferred-mode-handler will produce 1 mega-result with many support-graphs after running the templates.

A template (n0 -> inter_1 -> n2) can return > 1 result if the intermediate nodes aren't set to is_set: true - which is what I plan to do. In this situation, there'll be separate results for each unique set of intermediate nodes, which is kinda partway to the desired output…

Then I'm not sure if the inferred-mode-handler logic will continue to keep those results separate vs merge them into 1 mega-result…

(2)

I'm confused on how the number of support-graphs relates to the number of final-formatted results, ex: point 3's "count the number of support graphs, multiplying each by the number of intermediate nodes"

I assumed that there'd be 1 final-formatted result per unique intermediate node…so if that intermediate node was in multiple template results (aka diff support-graphs), those would all be put together into that final-formatted result.

(3)

In this situation + our current subclassing code, the subclasses of n0/n2 entity IDs will count as intermediate nodes. Is this a problem / an issue to ask Translator about? I'm not sure if the other teams have implemented this subclassing feature and will encounter this…

(4)

I'm not sure on the last point, because I thought the e2 support-graph would still differ between each final-formatted result. It sounds like the e2 support-graph is basically the union of the edge sets in the e0 and e1 subgraphs. AKA it's still a subgraph containing n0, 1 specific intermediate, and n2.

colleenXu commented 4 months ago

Regarding point 1 of the "Important Considerations", I'm going to review the problem again to see if it's still relevant...

tokebe commented 4 months ago

Responding to @colleenXu's feedback:

  1. That's not an assumption; It's a summary of how the inferred-mode handler (currently) works. The inferred-mode handler doesn't maintain template results, it completely mutates them when merging the template response. Template results (of which there are many in conventional inferred mode as well) are mapped back to the one-hop inferred query (hence why I expect we'll need to generate an "artificial" one-hop query to work with the handler) based on its two nodes. Since every result will map to those two nodes, and they're both pinned in the query, every template result will be merged regardless of pathway, with the pathways being represented as support graphs. This is how the handler has worked since the support graph refactor.
  2. You're correct, what I wrote was a messy heuristic. I see where more complicated score merging might come from, since multiple support graphs could have overlapping intermediate nodes which nominally would show up in some sort of combined state for the final result format. That said, I think for this prototype we should keep it simple and assume that support graphs don't overlap, even if they do. That simplifies the traversal code significantly (and thus the time to implement) at the cost of maybe a few result topologies, which should be absolutely fine for a rough prototype.
  3. I don't think we need to ask about how to deal with this. UI can handle nested support graphs, so we just need to make sure we're handling subclass support graphs specially. We can either leave them as support graphs on edges within the result support graph, or we can flatten them out to the same level. @rjawesome I leave it to you to decide whichever is easiest to implement.
  4. Well, yes, it's a union of the e0 and e1 support graphs. Which...means it'll have every intermediate node? I'm not sure why you think it'd have 1 specific intermediate and not every intermediate, which would be contained within the support graphs being union'd...Regardless, the intent for that edge that we've been told is that it's the entire pathway, nothing cut. So, it'd be the same pathway for every result that is using two subsets of that pathway.
colleenXu commented 4 months ago

Update

Jackson @tokebe: here's some slides based on our discussions of this pathfinder prototype so far. It's editable, so you should be able to adjust things. This should be useful for discussions, including with @rjawesome.

On point (2):


Here's some stuff that came out of our 1-on-1 discussion today:

  1. We're now on the same page: at the end of running templates/"normal inferred-mode execution", there'll be 1 result with 1 "mega-edge" between the pinned nodes n0 and n2. This "mega-edge" will have tons of support-graphs. 1 of these support-graphs = 1 result from a template
  2. See question above. I think we're basically on the same page, but I want to clarify the wording.
  3. @rjawesome We basically agreed that the nested subgraphs for subclassing stuff should be basically ignored for this formatting. It'd be more work to unpack them and it doesn't feel quite right to treat the subclass nodes as "intermediate node answers".
  4. we're on the same page regarding e2 now. When iterating through each support-graph in the "mega-edge", this support-graph can be kept and used for e2. If there's multiple intermediate nodes that get split into different "new results", they can use the same e2/subgraph-ref.
colleenXu commented 4 months ago

@tokebe @rjawesome

I'm putting the pathfinder template-groups and templates here: https://github.com/biothings/bte_trapi_query_graph_handler/tree/pathfinder-templates/data

[EDIT: The notes below aren't using the potential answers Sui posted]

Notes on Case A: how does imatinib affect asthma? (drug - disease)

* assuming the input curies are PUBCHEM.COMPOUND:5291 (imatinib), [MONDO:0004979](https://monarchinitiative.org/MONDO:0004979) (asthma) * but I'm seeing some discussion of "allergic asthma". Which may be [MONDO:0004784](https://monarchinitiative.org/MONDO:0004784). * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878837/CaseA_template1.json): gene intermediate. Runs in 1 min 30s, 684 results, top result is KIT * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878862/CaseA_template2.json): gene + cell intermediates. Runs in 51s, 380 results, top result is KIT + mast cell * 3rd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14881778/CaseA_template3.json): gene + PhysiologicalProcess/pathway intermediates. Runs in 5 min 33s, 1876 results. Results include previously interesting PhysiologicalProcess intermediate nodes like "immune response", "bronchoconstriction", "cytokine production"

Notes on Case B: how does resveratrol affect glyoxalase? (chemical - gene)

* assuming the input curies are PUBCHEM.COMPOUND:445154 (Resveratrol), NCBIGene:2739 (glyoxalase, GLO1) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878894/CaseB_template1.json): 1 gene intermediate. Runs in 28 s, 84 results, result 3 is NFE2L2 (NCBIGene:4780). Didn't see any MAPK. * 2nd template (response too large to attach): 2 gene intermediates. Runs in 1 min 1 s, 4472 results. Result 10 includes NFE2L2. Result 116 (or around there) is a possible answer (Resveratrol ➡️ MAPK1 (ERK2, NCBIGene:5594) ➡️ NFE2L2 ➡️ GLO1). **Possible answer notes** * From [ref](https://pubs.acs.org/doi/10.1021/jf302831d) ([Translator Slack](https://ncatstranslator.slack.com/archives/C06HL343420/p1707263932635189)): Resveratrol ➡️ ERK ➡️ Nrf2 ➡️ genetic element "antioxidant response element" ➡️ glyoxalase * ERK (extracellular signal-regulated kinase) genes or proteins: * MAPK3 (NCBIGene:5595) aka ERK1 * MAPK1 (NCBIGene:5594) aka ERK2 * [NFE2L2](https://en.wikipedia.org/wiki/NFE2L2) gene [NCBIGene:4780](https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=4780&rn=1) == Nrf2 protein ([nuclear factor erythroid 2-related factor 2](https://en.wikipedia.org/wiki/NFE2L2)). * BUT there'll be text-mining related confusion >.<. There's another Nrf2 protein (nuclear respiratory factor 2 == GABPA gene) that's also a transcription factor. ([ref](https://pubmed.ncbi.nlm.nih.gov/23597778/), was linked by [this paper section 2.1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6888570/)) * genetic feature "antioxidant response element": [SRI NameResolver](https://name-lookup.ci.transltr.io/lookup?string=antioxidant%20response%20element&offset=0&limit=10) says there's a MESH ChemicalEntity term for this. * From [Translator Slack](https://ncatstranslator.slack.com/archives/C06HL343420/p1707257003050949): * RELA gene (NCBIGene:5970) == Transcription factor p65 also known as nuclear factor NF-kappa-B p65 subunit protein * **Negative controls?** * Look for MAPK8-14. [Ref](https://pubs.acs.org/doi/10.1021/jf302831d) said it wasn't using [p38](https://en.wikipedia.org/wiki/P38_mitogen-activated_protein_kinases) (MAPK11-14) or c-Jun N-terminal kinase (JNK) pathways (MAPK8-10)

Notes on Case C: is there a possible genetic link between Crohn disease and Parkinson disease? (disease - disease)

* assuming the input curies are MONDO:0005011 (Crohn disease) and MONDO:0005180 (Parkinsons) * possible answers: * [LRRK2](https://www.nature.com/articles/s41531-021-00170-1) * but also, [it's not really clear if there's a genetic link](https://www.nature.com/articles/s41531-022-00318-7) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14879166/CaseC_template1.json): SequenceVariant intermediate: Runs in 16 s, 1 result [rs2066842 in gene NOD2](https://www.ncbi.nlm.nih.gov/snp/rs2066842) * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14879193/CaseC_template2.json): Gene intermediate. Runs in 1 min 55 s, 392 results, 1st result is LRRK2.

Notes on Case D: What "molecular mechanisms" could explain the link between SLC6A20 and susceptibility to COVID19? (gene - disease)

* assuming the input curies are NCBIGene:54716 (SLC6A20), [MONDO:0100096](https://monarchinitiative.org/MONDO:0100096) (COVID19) * SLC6A20 is the gene name, VS the protein product has multiple names (SIT1, sodium-dependent Imino Transporter 1 / System IMINO transporter, XTRP3) * possible answers: * ACE2 (gene) - [2nd paragraph of this paper](https://www.nature.com/articles/s41421-023-00596-2) * [glycine](https://www.degruyter.com/document/doi/10.1515/bmc-2021-0017/html) (amino acid, chemical entity) * 1st template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878738/CaseD_template1.json): 1 unconstrained intermediate. Runs in 2 min 26s, 38 results, top result is ACE2. Glycine is also there, as last result. * 2nd template [(saved response)](https://github.com/biothings/biothings_explorer/files/14878784/CaseD_template2.json): chemical + gene/protein intermediates. Runs in 1 min 30s, 186 results. 6th result is the possible answer SLC6A20 ➡️ glycine ➡️ ACE2 ➡️ COVID19. Other results include glycine (1st, 3rd) or ACE2 (11th).


Template guidelines:

tokebe commented 4 months ago

so one way to restate it is: for each template run, record (number of results) * (number of intermediate nodes in that template's QGraph, will probably be 1 or 2). Sum these together as templates are run. Stop when this number >= 500.

@colleenXu Agreed, this is a good heuristic.

@rjawesome Please note that we've finalized our expectation for how "final" results should be generated when iterating over template results (as inferred-handler result support graphs). This is detailed on slides 9-13 in the above linked slides.

colleenXu commented 4 months ago

@rjawesome @tokebe

I have an update on point 1 of the "Important Considerations": I can't recreate the buggy behavior, so maybe things are fine?

The previous buggy behavior was: I set up a Pathfinder TRAPI query with two starting IDs/nodes that we shouldn't find any results for - but instead results were returned that connected only to the first starting ID/node. (ref: lab Slack convo starting here)

But I wasn't able to recreate this behavior using the current pathfinder-templates branch

1. Check out the query-handler pathfinder-templates branch, which also has the check on QNode IDs modified https://github.com/biothings/bte_trapi_query_graph_handler/commit/8deeada313e8547b43acb2074a95f6027eaa85e2 ([commit](https://github.com/biothings/bte_trapi_query_graph_handler/commit/7fe8c27d96b7303d7cd79126031b6f836b1caf83) originally from @tokebe's inferred-explain branch). Be on the main branches for all other modules. 2. Adjust the query-handler `data/templateGroups.json` array so it only contains the `"Pathfinder: Drug-Disease"` template-group object. 3. Setup BTE in "CI" mode (`pnpm build`, `API_OVERRIDE=true INSTANCE_ENV=ci pnpm run smartapi_sync`, then I use `INSTANCE_ENV=ci USE_THREADING=false pnpm start`) 4. Run this query: this is basically what the "artificial query graph" that goes into the inferred-mode-handler will look like = 1 QEdge with predicate related_to and knowledge_type inferred + both QNodes set to IDs. 5. BTE finds NO results which is the expected behavior. ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["PUBCHEM.COMPOUND:5291"], "categories":["biolink:ChemicalEntity"], "name": "imatinib" }, "n1": { "ids":["MONDO:0011821"], "categories":["biolink:DiseaseOrPhenotypicFeature"], "name": "Meckel syndrome, type 3" } }, "edges": { "e0": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```

But I did hit another bug, which didn't halt execution. I'll open another issue for it.

rjawesome commented 3 months ago

Functionality should be finished in the pathfinder branch of bte_query_graph_handler repo. I did want to add in a few tests / check over the code a bit more before making a PR. However, it would be a good idea to make sure that the functionality is implemented correctly.

Current test query that I have been using

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "PUBCHEM.COMPOUND:5291"
                    ],
                    "categories": ["biolink:Drug"]
                },
                "un": {
                    "categories": [
                        "biolink:NamedThing"
                    ]
                },
                "n2": {
                    "ids": [
                        "MONDO:0004979"
                    ]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "un",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                },
                "e1": {
                    "subject": "un",
                    "object": "n2",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                },
                "e2": {
                    "subject": "n0",
                    "object": "n2",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}
tokebe commented 3 months ago

@colleenXu @rjawesome I've done a brief code review of the branch, the execution looks pretty straightforward and good to me, so it's on to testing.

There are a couple of notes, which might be better discussed in a draft PR:

rjawesome commented 3 months ago
  1. I think I added that type when I was working on it earlier but it is no longer needed. I removed it.
  2. Should be addressed in the latest commits.
colleenXu commented 3 months ago

@rjawesome @tokebe

I checked out the pathfinder branch and I can't successfully build. Perhaps the issue is that this branch isn't merged with the latest main?

Here's the error

``` @biothings-explorer/query_graph_handler:build: > @biothings-explorer/query_graph_handler@1.18.0 build /Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler @biothings-explorer/query_graph_handler:build: > tsc -b @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: src/batch_edge_query.ts:1:20 - error TS2614: Module '"@biothings-explorer/call-apis"' has no exported member 'RedisClient'. Did you mean to use 'import RedisClient from "@biothings-explorer/call-apis"' instead? @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: 1 import call_api, { RedisClient } from '@biothings-explorer/call-apis'; @biothings-explorer/query_graph_handler:build: ~~~~~~~~~~~ @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build: Found 1 error. @biothings-explorer/query_graph_handler:build: @biothings-explorer/query_graph_handler:build:  ELIFECYCLE  Command failed with exit code 1. @biothings-explorer/query_graph_handler:build: ERROR: command finished with error: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1) @biothings-explorer/query_graph_handler#build: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1) Tasks: 9 successful, 10 total Cached: 9 cached, 10 total Time: 7.4s Failed: @biothings-explorer/query_graph_handler#build ERROR run failed: command exited (1)  ELIFECYCLE  Command failed with exit code 1. ```

colleenXu commented 3 months ago

@rjawesome @tokebe

I've added support for example/cases 2 (chem - gene) and 3 (disease - disease):

That branch is also has an update to one of the earlier templates (commit) AND is merged with the latest main.

tokebe commented 3 months ago

@rjawesome You'll have to pull in the latest from the main branch and fix any merge conflicts

rjawesome commented 3 months ago

@colleenXu The main branch should be merged now, which fixes the RedisClient error. Also, the new templates from pathfinder-templates have been merged into pathfinder

tokebe commented 3 months ago

@rjawesome: @colleenXu and I ran some testing on the imatinib-asthma example, and we're seeing some odd behavior:

colleenXu commented 3 months ago

@rjawesome

This goes with Jackson's comment above. I think it's easiest to understand visually w/ screenshots. I'm comparing the pathfinder run to running just the template that it's using. Here's the full response jsons for both, which I viewed in a json-viewer and in ARAX-UI (import -> response):

(Thankfully, this example query is pretty simple: 1 template ran, this template provides unique, single intermediate nodes in each result. So there's a 1-to-1 match between final pathfinder results and the template's results)

Point 1 example: Everything related to this intermediate node should have been pruned, but it's all still there

This is the bottom result for the template. This intermediate node (FBLN5, NCBIGene:10516) should be removed from the KG, as well as the stuff associated with it (edges + aux-graphs that are unique to this intermediate node's pathfinder result, both the original template stuff and the pathfinder-constructed stuff). ![Screen Shot 2024-04-04 at 2 17 09 PM](https://github.com/biothings/biothings_explorer/assets/43731687/3a515ff6-ce95-4cbc-82fb-bc2952f7053e) But they're still there in pathfinder-response: A KG Node ![Screen Shot 2024-04-04 at 2 20 53 PM](https://github.com/biothings/biothings_explorer/assets/43731687/5a33cfb4-518b-4960-b7b8-1018d87f85a2) A normal edge (from template) ![Screen Shot 2024-04-04 at 2 21 18 PM](https://github.com/biothings/biothings_explorer/assets/43731687/746c5451-1a54-4f1a-910f-8db7ec672cbd) Pathfinder edges and aux-graphs ![Screen Shot 2024-04-04 at 2 25 01 PM](https://github.com/biothings/biothings_explorer/assets/43731687/a19725fa-8fe1-417e-9879-4b3d896f87af) ![Screen Shot 2024-04-04 at 2 25 24 PM](https://github.com/biothings/biothings_explorer/assets/43731687/d6512f0e-464b-446d-a574-b0ce3fa33183)

Point 2 example: pathfinder support-graph issues

This is showing the first template's result, with the intermediate node KIT (NCBIGene:3815). We expect e0's support-graph to include all the edges from imatinib (n0) to KIT (un), e1 to include the edge from KIT to asthma (n2), and e2 to include all the edges in this result. ![Screen Shot 2024-04-04 at 2 28 04 PM](https://github.com/biothings/biothings_explorer/assets/43731687/83335b1c-937f-4109-9e55-bc977ec3295b) So then we look at the first pathfinder result... ![Screen Shot 2024-04-04 at 2 32 05 PM](https://github.com/biothings/biothings_explorer/assets/43731687/f60d7696-8735-408c-b928-91293f7b0c11) e0 has all the edges in the result (which is what we wanted for e2) ![Screen Shot 2024-04-04 at 2 33 12 PM](https://github.com/biothings/biothings_explorer/assets/43731687/fb3bac3d-a4d0-446a-85b4-1bd39cbd8e03) e1 has no edges (empty array) ![Screen Shot 2024-04-04 at 2 34 08 PM](https://github.com/biothings/biothings_explorer/assets/43731687/c709053f-040b-4cd0-9109-442408667d72) ![Screen Shot 2024-04-04 at 2 34 28 PM](https://github.com/biothings/biothings_explorer/assets/43731687/34716440-60e1-4880-92e4-5253358fab65) e2 has a ton of support graphs ![Screen Shot 2024-04-04 at 2 35 16 PM](https://github.com/biothings/biothings_explorer/assets/43731687/7beb0917-8b17-4751-85b1-33e33de8222a)

rjawesome commented 3 months ago

Pruning has been added to pathfinder. Intermediate edges (e0/e1 from Jackson's test) and main edge (e0 from Jackson's test) have been updated so their auxiliary graphs should be correct now.

colleenXu commented 3 months ago

@rjawesome @tokebe

I've added support for the last example 4/D (gene - disease):

Should I make template adjustments directly in the pathfinder branch from now on?

tokebe commented 3 months ago

Should I make template adjustments directly in the pathfinder branch from now on?

@colleenXu Yes, I think that makes sense. It shouldn't cause any merge issues with any work done to code in the branch.

tokebe commented 3 months ago

@rjawesome I've reviewed your changes and each result edge looks nearly correct now. I see only one remaining problem -- the now correctly-aux-graph'd e2 has its subject and object as n0->un when it should be n0->n2 (even though the edge is bound correctly in the result).

rjawesome commented 3 months ago

e2's aux graph has now been fixed. I've also added some more tests around this behavior.

colleenXu commented 3 months ago

@tokebe @rjawesome

I think we are preserving the support-graph info for subclass-edges correctly.

However, it's not showing up properly in the ARAX-UI. This is happening both for our "normal" creative-mode and our pathfinder responses. It's odd because I recall this stuff showing up properly in the past.

Example from normal creative-mode

[Saved response](https://github.com/biothings/biothings_explorer/files/14892715/normal-creative-Acanthosis-nigricans.json) from running "treats"-creative mode for MONDO:0007035 (Acanthosis nigricans). The 4th result has a top-level creative-support-graph. ![Screen Shot 2024-04-05 at 11 53 07 PM](https://github.com/biothings/biothings_explorer/assets/43731687/bce89773-5ee7-4934-a1f2-1ea605832feb) When I go into that support-graph and then look at the pheno edges, all should have support-graphs based on their IDs. Instead, no info is shown - not even source info. ![Screen Shot 2024-04-05 at 11 52 17 PM](https://github.com/biothings/biothings_explorer/assets/43731687/b9495311-c41a-4529-8473-37baf472dbae)

Example from pathfinder Case A (imatinib-asthma)

The 5th result in the [template run](https://github.com/biothings/biothings_explorer/files/14878837/CaseA_template1.json) is PDGFRA. When you look at that template's run in ARAX-UI, you can see the support-graph/source info for one of the PDGFRA->asthma edges. ![Screen Shot 2024-04-05 at 11 11 49 PM](https://github.com/biothings/biothings_explorer/assets/43731687/249b2e63-6c55-429e-b772-29e1af58eba4) But if you look at 5th pathfinder result in ARAX-UI [(saved response)](https://github.com/biothings/biothings_explorer/files/14892656/CaseAPathfinder.json), that same edge now doesn't show any info. ![Screen Shot 2024-04-05 at 11 20 34 PM](https://github.com/biothings/biothings_explorer/assets/43731687/49723765-fee0-40a2-b2dd-4fd76a197188) When I dig into the pathfinder json, all the info for this subclass-edge/its linked support-graph seems to exist and be properly formatted.

The subclass edge

``` "NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": { "predicate": "biolink:gene_associated_with_condition", "subject": "NCBIGene:5156", "object": "MONDO:0004979", "attributes": [ { "attribute_type_id": "biolink:support_graphs", "value": [ "support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass" ] } ], "sources": [ { "resource_id": "infores:biothings-explorer", "resource_role": "primary_knowledge_source" } ] }, ```

the subclass support-graph

``` "support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": { "edges": [ "13aa493dafd322cb77c438173de6abd4", "expanded-MONDO:0005405-subclass_of-MONDO:0004979" ] }, ```

The support-graph's edges + subclass-disease node

Gene to subclass-disease ``` "13aa493dafd322cb77c438173de6abd4": { "predicate": "biolink:gene_associated_with_condition", "subject": "NCBIGene:5156", "object": "MONDO:0005405", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:16804324" ], "value_type_id": "linkml:Uriorcurie" } ], "sources": [ { "resource_id": "infores:disgenet", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:mydisease-info", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:disgenet" ] }, { "resource_id": "infores:biothings-explorer", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:mydisease-info" ] } ] }, ``` subclass-disease to main-disease ``` "expanded-MONDO:0005405-subclass_of-MONDO:0004979": { "predicate": "biolink:subclass_of", "subject": "MONDO:0005405", "object": "MONDO:0004979", "attributes": [], "sources": [ { "resource_id": "infores:mondo", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:biothings-explorer", "resource_role": "aggregator_knowledge_source" } ] }, ``` subclass-disease node exists as well ``` "MONDO:0005405": { "categories": [ "biolink:Disease" ], "name": "childhood onset asthma", "attributes": [ { "attribute_type_id": "biolink:xref", "value": [ "MONDO:0005405", "DOID:0080815", "UMLS:C0264408", "MEDDRA:10081274", "SNOMEDCT:233678006" ] }, { "attribute_type_id": "biolink:synonym", "value": [ "childhood onset asthma", "childhood-onset asthma", "Childhood asthma" ] } ] }, ```

colleenXu commented 3 months ago

This post will be recording what tests I'm running, the response-jsons, basic response stats, and other notes. I'll raise errors/problems in separate comments.

Basic tests

click to expand

Different starting query topologies (does it correctly throw error or continue execution): * Only two edges (correct error) * Different edge directions (correct error) * Different node/edge labels (correct continues execution) * Don't include categories on starting ID nodes (correct continues execution) imatinib -> Meckel syndrome, type 3 (MONDO:0011821): (chem - disease) NEGATIVE CONTROL from [previous comment](https://github.com/biothings/biothings_explorer/issues/794#issuecomment-2014500106) * runs in 29s * 0 results! after running all 3 templates.

Cases

Noting my possible answers and Sui's possible answers.

Case A (asthma) is an example of truncating the 1st template's results to get a 500 result set.

Case A (allergic asthma) and D have results/intermediate nodes that were found in multiple templates (showing that the merging code worked as-intended).

2 Case A (chem - disease) examples

imatinib (PUBCHEM.COMPOUND:5291) -> asthma (MONDO:0004979) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892656/CaseAPathfinder.json): * runs in 2 min 11s * 500 results * Only runs 1st template and prunes extra template results * found Sui's possible answers * KIT: top result * SCF (aka KITLG, KIT ligand): 359th result imatinib -> allergic asthma (MONDO:0004784) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892782/CaseAPathfinder-allergicAsthma.json): * runs in 1 min 38s * 419 results * Runs all 3 templates, results only from 1st and third. Doesn't prune any template results. * found Sui's possible answers * KIT: 15th result * SCF (aka KITLG, KIT ligand): 279th result * found my possible answers * immune response: 3rd result

Case B (chemical - gene) - currently running only 1 template

Resveratrol (PUBCHEM.COMPOUND:445154) -> glyoxalase, GLO1 (NCBIGene:2739) [(saved response)](https://github.com/biothings/biothings_explorer/files/15014937/CaseBPathfinder_simple.json) * runs in 32s * 84 results * Runs only 1 simple template * found 1 of Sui's possible answers * NFE2L2: 3rd result

Case C (disease - disease)

Crohn Disease (MONDO:0005011) -> Parkinson Disease (MONDO:0005180) [(saved response)](https://github.com/biothings/biothings_explorer/files/14892832/CaseCPathfinder.json) * runs in 2 min 22s * 393 results * Runs both templates, results from both. Doesn't prune any template results. * found all Sui's possible answers? (I'm not sure if Sui meant MOD2 gene or NOD2 gene. We have NOD2 variant rs2066842 as top result + NOD2 gene as 7th result) * LRRK2: 2nd result * PARK7: 3rd result

Case D (gene - disease)

SLC6A20 (NCBIGene:54716) -> COVID19 (MONDO:0100096) [(saved response)](https://github.com/biothings/biothings_explorer/files/14939017/CaseDPathfinder.json) * runs in 4 min 7s * 116 results * Runs both templates, results from both. Doesn't prune any template results. * found Sui's possible answers * ACE2: top result (graph includes glycine) * CXCL8: result 9 * found my possible answers * glycine: 3rd result

colleenXu commented 3 months ago

@rjawesome @tokebe

A problem: pathfinder doesn't find templates for Case B (chem - gene). I'm not sure what's going on.

Query I'm using

``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["PUBCHEM.COMPOUND:445154"], "categories":["biolink:ChemicalEntity"], "name": "Resveratrol" }, "un": { "categories": ["biolink:NamedThing"] }, "n2": { "ids": ["NCBIGene:2739"], "categories":["biolink:Gene"], "name": "glyoxalase, GLO1" } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```

tokebe commented 3 months ago

@colleenXu I'll be working on the pathfinder prototype this week as Rohan is unavailable.

Regarding ARAX UI problems, that might be worth reporting to them -- otherwise it's a good note that we should trust our own JSON analysis first.

I'll take a look into the Case B issue.

colleenXu commented 3 months ago

Reported in Translator architecture channel (link here)

colleenXu commented 3 months ago

@tokebe Whoops I didn't set the pathfinder flag on the Case B template group. Added this in a recent commit. Haven't analyzed the behavior yet though.

colleenXu commented 3 months ago

There's still a problem running Case B. The 2nd template runs quickly (1 min 1s), but returns a lot of results (4472). Inferred-mode then seems to get stuck on "merging" all of the results into 1 mega-result/creative-edge - it may take ~ 1 hour? And then Pathfinder also seems to get stuck finding the intermediate nodes (I didn't wait for it to complete).

I was thinking of Case B as testing multiple things that don't happen with the other cases:


@tokebe For tomorrow's deployments, I've made a branch pathfinder-simpleCaseB that doesn't use the 2nd chem-gene template. BTE will then successfully run the chem-gene example (CaseB) - but it won't find much.

rjawesome commented 3 months ago

Case B should be fixed. There was an unnecessary while loop that was causing the issues in the inferred mode handler. For the intermediate nodes, the "paths" involved were getting too long so I changed it so each "path" will only use edges from one template result (ie. each path will only include one pair of intermediate genes), but each intermediate node will merge all the "paths" that include it). Previously the paths were getting too long by combining many edges from different template results.

colleenXu commented 2 months ago

Note:

In the Translator Architecture 4/23 call, the UI team said they'll handle "4-hop paths" (aka 4 edges long).

I think we'll stay at/under that limit with our current Pathfinder templates. All are 2-3 QEdges long.

There's 1 potential case where BTE would generate 5-edge paths: if it ran the 2nd/3rd "Chem-Disease" templates (3 QEdges) and results involved descendants of both the chemical and the disease starting-ID (+2 subclass_of edges). However, I think it's relatively rare for us to do subclass-expansion on chemical starting-IDs.

tokebe commented 2 months ago

@rjawesome Does your optimization change the output at all?

rjawesome commented 2 months ago

It basically just limits the length of result "paths," so it doesn't compute graphs that have more hops then what is specified in the template (excluding subclass hops).

tokebe commented 2 months ago

So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?

If not, can you briefly describe the steps in your current implementation, comparing them to the approach in the slides?

rjawesome commented 2 months ago

So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?

Yes.

colleenXu commented 2 months ago

@tokebe @rjawesome

I think there's a problem! The new code is giving different output with less results, missing KG edges, and different scores.

I saw this with Case A allergic asthma:

Actual Pathfinder TRAPI query

``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["PUBCHEM.COMPOUND:5291"], "categories":["biolink:ChemicalEntity"], "name": "imatinib" }, "un": { "categories": ["biolink:NamedThing"] }, "n2": { "ids": ["MONDO:0004784"], "categories":["biolink:DiseaseOrPhenotypicFeature"], "name": "allergic asthma" } }, "edges": { "e0": { "subject": "n0", "object": "un", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e1": { "subject": "un", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" }, "e2": { "subject": "n0", "object": "n2", "predicates": ["biolink:related_to"], "knowledge_type": "inferred" } } } } } ```

Here's what I found when digging in:

Expand to see logs

``` bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +0ms bte:biothings-explorer-trapi:inferred-mode pruned 75 nodes, 246 edges, 0 auxGraphs from combinedResponse. +4ms bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Performing search for intermediate nodes. +2m bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Pathfinder found 344 intermediate nodes and created 1032 support graphs. +28ms bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +32ms bte:biothings-explorer-trapi:inferred-mode pruned 0 nodes, 1 edges, 386 auxGraphs from combinedResponse. +7ms ```

Screenshots of AKT1 result showing missing edge/diff score

Previous run: ![Screen Shot 2024-05-01 at 11 08 49 PM](https://github.com/biothings/biothings_explorer/assets/43731687/e656e823-53c0-45e8-a628-05592a68bb8e) Current run: ![Screen Shot 2024-05-01 at 11 08 42 PM](https://github.com/biothings/biothings_explorer/assets/43731687/2a79dfbc-a55f-4d37-8af2-5d00c1b2f000)

rjawesome commented 2 months ago

I accidentally introduced a bug when speeding up the while loop that assigned support graph suffixes in the inferred mode handler. Should be fixed now.

colleenXu commented 2 months ago

It looks good!

EDIT: First, I've reran all the "working" cases (not Case B).

First I reran all the "working" cases (not Case B).

For Case A allergic asthma (new saved response), I now see the same number of results (419), KG nodes and edges, and aux-graphs as before. And for all the cases, the interesting results from before are still present.

I see some differences between the runs now and the previous runs, but I think these are okay:

Some cases ran faster than before:

Other cases ran slower than before:

colleenXu commented 2 months ago

@tokebe @rjawesome

Something else is going on with Pathfinder and Case B, and I can't tell if it's okay or a sign of a truncation problem.

The good news is that it now ran both templates in 2 min 16 s (much better than running forever!). As a reminder, the second template returns >4000 results (>1000 nodes and >7000 edges) that needs truncating.

Here's a Google Drive folder w/ my Case B Pathfinder run and the an old run of the 2nd template I'm comparing it to (it's not an exact match to the Pathfinder's 2nd template run, but I think it's close enough for what I want to demonstrate).

What I'm seeing: while there's only 500 results in the Pathfinder run...

I didn't notice any truncation issues for Case A asthma and Case C (see my previous notes).

tokebe commented 2 months ago

There should definitely be a large number of nodes and edges that aren't bound to a result directly -- we'd expect a lot of nodes that are exclusively bound to an edge used in a support graph for an edge bound to a result, which could leave a lot of extra nodes and edges that don't have an immediately obvious reason for existing.

It could still be the case that there are nodes and edges that aren't properly truncated, I think the only way we can meaningfully check this is by writing a script that parses a response and checks that every node/edge somehow links (directly or indirectly) to a result. It would have to start with results and then work its way out to build out lists of bound edges/nodes/support graph IDs, and then check those lists against the actual KG and support graph set. @rjawesome could you put together such a script? We'd probably want to adapt it to an integration test later, so it would see use beyond just checking this one time.

rjawesome commented 2 months ago

I added a test here for pathfinder in particular: https://github.com/biothings/bte_trapi_query_graph_handler/blob/894bbb0e53148035ab73cd44ca4f22e3af5e6fb1/__test__/unittest/pathfinder.test.ts#L103-L146 If pfResponse was to read from a file, then this could function as a "script" to check any given TRAPI response

tokebe commented 2 months ago

Did some messing around with @rjawesome's test to make a script and was able to confirm that yes, pruning is working as expected. Case B just creates huge support graphs which results in many many edges.