Create a way to export the query execution plan

newgene commented 3 years ago

Similar to explain option for many database server, can we create an optional explain parameter, which will only return the planned API request execution plan, without retrieving the API response.

Before we actually implement this, we need to check if the current logs output can already serve the purpose (but I personally find very hard to navigate the entire log msgs).

It may be a bit complicated with the new iterative query execution algorithm, but the execution plan can be made without going through that process, just output the possible API requests by each edges (without specifying the actual execution order).

newgene commented 3 years ago

assigning this to @tokebe and @marcodarko. Can start with @tokebe to look into what's needed at call-apis package, then work with @marcodarko to see if anything is needed from the new query algorithm code at query_graph_handler package.

tokebe commented 2 years ago

@newgene @marcodarko On the call-apis side of things, all that's really needed is a new class function or two which returns the created axios query configs, which can then either be returned as-is or transformed for clarity in the response. I could make a PR with that for @marcodarko to look at, and then we can go from there?

colleenXu commented 2 years ago

My understanding is that the algorithm actually works (decides the order of execution) using the number of IDs in each step (as nodes get populated after one-hops).....

It would change the algorithm to do it based off of the number of operations possible between hops (w/o considering the number of IDs?).

colleenXu commented 2 years ago

tokebe commented 2 years ago

Relevant code:

_edgeSupported runs through query edges (making assumptions that edges get results) to ensure that smartapi edges exist for each query edge. This general pattern could be extended to implement this issue?
query.js in call-apis would appear to have some implementation regarding dryrun_only which may be relevant/worth extending for this purpose.

ericz1803 commented 2 years ago

@tokebe was right, _edgeSupported already sort of does a dry run by running through the query edges and getting the smartapi edges. Using this, I logged info about each edge as well as the individual queries made.

@newgene @colleenXu @tokebe Is this the format we are looking for/is there any other info that should be getting logged?

...
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "Edge e02: biolink:Gene > biolink:SmallMolecule",
    "code": null
},
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "APIs called: Multiomics Wellness KP API,BioThings SEMMEDDB API,MyChem.info API,Text Mining Targeted Association API,BioThings DGIdb API,Automat Pharos (trapi v-1.2.0),Automat Cord19 (trapi v-1.2.0),Automat HMDB (trapi v-1.2.0),Automat Gtopdb (trapi v-1.2.0),Automat Hetio (trapi v-1.2.0),Automat CTD (trapi v-1.2.0) (181 queries total)",
    "code": null
},
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "Multiomics Wellness KP API: Gene > correlated_with > SmallMolecule",
    "code": null
},
...

tokebe commented 2 years ago

@ericz1803 Just to clarify, the logs will list all APIs queried for a query edge, and then for each of those APIs list the specific sub-query edge? This sounds pretty good to me. @newgene and @colleenXu may have other specific intentions/requirements.

I definitely think using the existing TRAPI log format is the best way to do this as it keeps our response nice and TRAPI-compliant. I'd suggest changing the log levels to INFO to reflect that this is information specifically for/requested by the user, as it should make parsing out the execution plan logs from any potential debug logs simpler.

colleenXu commented 2 years ago

In the example above, it looks like all 181 sub-queries will have a log describing the semantics of the operation it's doing....which feels like too much.

Some thoughts:

I thought the main goal was to know the query-execution plan WITHOUT actually waiting the seconds-minutes for the whole query to execute.
BTE does walk thru the execution to make sure we have operations/metaKG edges for each edge. The goal is then to ONLY DO THIS STEP AND THEN STOP EXECUTION. And then probably give a TRAPI-compliant response with an empty object for message.knowledge_graph.nodes / message.knowledge_graph.edges and an empty list for message.results. The info we'd want would be in the logs.
the logs would at minimum give the edges in order of execution (this can be in 1 log total or 1 log per edge) and whether each edge is executed in the reverse direction or not.
I think it'd not necessary, but nice to include the logs for:
- which APIs aren't available during execution (one of the early current logs)
- which APIs are used for sub-querying for each edge's execution / number of sub-queries possible

tokebe commented 2 years ago

Just to respond to what I can here:

This doesn't appear to actually execute the queries, it just assembles them and logs the planned API and edge -- I should have been more clear in my previous comment.
I agree that it may be a bit much to list every separate API call and the association to query, however it could prove valuable in some debugging cases. Perhaps we might want to include an additional parameter to specify verbosity in that respect?
- We definitely, at minimum, want the edge execution order, if it's reversed, AND the list of APIs to be called, as that's the minimum original intent of this issue. I think that some way for the user to specify if they want the individual sub-query plans would be a significant nice-to-have.

ericz1803 commented 2 years ago

Addressing a few things:

I thought the main goal was to know the query-execution plan WITHOUT actually waiting the seconds-minutes for the whole query to execute.

@colleenXu Yup, just to clarify, it doesn't actually query it, it just gets the APIs for each edge.

which APIs are used for sub-querying for each edge's execution / number of sub-queries possible

I only included the log for 1 of them, but for each edge, it logs every sub-query it will make as well

which APIs aren't available during execution (one of the early current logs)

This message is still logged (I just didn't include it in the snippet)

I'd suggest changing the log levels to INFO

Will do for sure

Perhaps we might want to include an additional parameter to specify verbosity in that respect?

@tokebe Do you know if the logging system right now has levels of verbosity? I could put the execution summary at the INFO level and the subquery stuff at a level above INFO and the user could choose which level of verbosity they want.

We definitely, at minimum, want the edge execution order, if it's reversed, AND the list of APIs to be called, as that's the minimum original intent of this issue.

[x] list of APIs
[x] reversed: I'll add this to the logs (easy to do)
[x] edge execution order: I'll implement @colleenXu's idea of doing entity count based on number of APIs in an edge (shouldn't be too difficult)

colleenXu commented 2 years ago

@ericz1803 I'm a bit confused. Comparing my comment to this https://github.com/biothings/BioThings_Explorer_TRAPI/issues/278#issuecomment-1025447567, my perspective is:

Jackson and Eric have both confirmed that this work will not involve actually executing the whole query. Great, that was just a check
To confirm, we're discussing edge-management (ex: that e2 (reverse) is executed first, and then e1) and we aren't discussing sub-queries (the order that APIs are queried)....
- I thought this example log already showed "which APIs are used for sub-querying for each edge's execution / number of sub-queries possible". Perhaps the list of APIs and the "181 queries total" phrase are referring to the entire query and not a single QEdge?
```
{
"timestamp": "2022-01-31T07:12:20.195Z",
"level": "DEBUG",
"message": "APIs called: Multiomics Wellness KP API,BioThings SEMMEDDB API,MyChem.info API,Text Mining Targeted Association API,BioThings DGIdb API,Automat Pharos (trapi v-1.2.0),Automat Cord19 (trapi v-1.2.0),Automat HMDB (trapi v-1.2.0),Automat Gtopdb (trapi v-1.2.0),Automat Hetio (trapi v-1.2.0),Automat CTD (trapi v-1.2.0) (181 queries total)",
"code": null
},
```

colleenXu commented 2 years ago

maybe this will help? The pink arrows show what I'm thinking about regarding "edge execution order". Screen Shot 2022-01-31 at 5 28 04 PM

And this is the query I was using (it takes about 1 min 17 sec to run). It's a simpler version of D2 and might be good for testing this functionality. The C2/C3 queries are also good for testing but are much more complicated.


{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "subject": "n00",
          "object": "n01"
        },
        "e01": {
          "subject": "n01",
          "object": "n02"
        }
      },
      "nodes": {
        "n00": {
          "ids": ["DRUGBANK:DB00215"],
          "categories": ["biolink:SmallMolecule"]
        },
        "n01": {
          "categories": ["biolink:Gene"]
        },
        "n02": {
          "ids": [
              "DOID:5844", 
              "DOID:1936", 
              "DOID:3393"
            ],
          "categories": ["biolink:Disease"]
        }
      }
    }
  }
}

tokebe commented 2 years ago

Regarding edge execution order and reversed status: As an example, _edgesSupported() already does the correct order by using an instance of EdgeManager and leveraging manager.getNext(), which handles order (See these lines). BTE does already log the edge execution and whether it's reversed in the main edge execution here (as shown in @colleenXu's screenshot), so in theory all that would need to be done is for @ericz1803 to copy that log into his code that gets the execution plan as edges are being checked.

ericz1803 commented 2 years ago

Just made a pull request based on these comments.

Some things that might be worth discussing still:

I didn't address this with my PR, but with the current LogEntry class/system, there is no way to choose which log level to log. We should consider switching to an out of the box logging solution or beefing up our existing solution
With more query parameters, is there anywhere we are documenting this? Also, it can be kind of tedious and error prone to manually edit smartapi.yaml file every time we add a new query parameter

tokebe commented 2 years ago

@ericz1803 Regarding your first point, log levels in BTE match the TRAPI standard defined here. The log level is selected by the first argument when creating a new LogEntry(), which you did in you PR -- there's nothing else that needs to be done there. The user can select which log levels to retrieve when making the query.

Regarding your second point, the most important place to add changes is smartapi.yaml, however I'm not sure that there's any place it's documented otherwise -- I would have thought the SmartAPI documentation online might update with changes to that file, but I guess not? @newgene Is there anywhere else we can document new parameters/etc for our endpoints?

newgene commented 2 years ago

Generally, yes, smartapi metadata is the place to document end-user facing query endpoints and parameters. It's typically a manual edit, which is fine for us, since the changes to its metadata are not too often.

When the feature is stable and deployed, we can leave a comment here or create a new issue to update the metadata. @colleenXu handles many smartapi metadata updates, so she may chime in on the optimal process for our team as well.

colleenXu commented 2 years ago

My understanding is that we currently have to manually edit the smartapi.yaml file in BTE main repo (used to validate what we send to the endpoints) and in the registry repo (used by the Smartapi registry, less important, just for documentation purposes...).

We've previously talked about only keeping 1 file and removing the other one...

colleenXu commented 2 years ago

Also, I know @ericz1803 and @tokebe discussed changing the endpoint to "Explain", given the opening post here by @newgene ....

But could we instead use something different like "edgeplan" or "dryrun"? Since I feel like "explain" is used often to mean "explain-type" querying (IDs on QNodes, attached together by QGraph). Related to #379

ericz1803 commented 2 years ago

@tokebe I made the changes for "dryrun" -> "explain" and changing the logging level to "DEBUG".

I second @colleenXu's point that "explain" could lead to some confusion. For the smartapi.yaml file, I meant that it doesn't feel great to have to put the parameter in 6 different place but maybe it's unavoidable with how the query urls are set up.

tokebe commented 2 years ago

@ericz1803 Apologies for the redundant work, given @colleenXu's comment dryrun does make more sense and should be used instead.

As for smartapi.yaml, yes, unfortunately this is just an unavoidable consequence of how it's structured/how that structure is used.

ericz1803 commented 2 years ago

All good, I made the change back to dryrun. If there's nothing else, the two PRs should be ready to merge

colleenXu commented 2 years ago

noting (this was originally sent in Slack):

I've reviewed these PRs for dryrun....and I think the dryrun parameter isn't working correctly with the asyncquery endpoints - it looks to me that the query is executing completely rather than stopping early (only logging the query plan).

It looks like it's working as-expected for the /query endpoints. But... The query-plan it gives still isn't what actually happens in execution....because it's not able to anticipate how many output IDs will happen after executing each edge. In reality, the edge-manager uses the number of "input" IDs for each edge to decide the execution order (doing lower number earlier)... Maybe this behavior is okay....since I'm not sure how to anticipate the output of each step beforehand...

I suggest: check with Chunlei on whether the current behavior is okay (only works on /query endpoints, only gives a "possible" query execution plan rather than the actual one) since he's the one who originally opened this issue

colleenXu commented 2 years ago

from discussion today:

the dryrun plan will accurately says the first edge to be executed, and it assumes that results/output IDs are found for each edge. This means that the order of the edges isn't necessarily correct after the first edge listed:

results/output IDs might not be found (sub-queries don't find any hits)
as mentioned above, edge-manager algorithm will look at the number of "input" IDs and pick the next edge based on the lower number of input IDs

biothings / biothings_explorer

Create a way to export the query execution plan #278

Addressing a few things: