Graceful exit before full execution for queries that would have very large responses

colleenXu commented 2 years ago

@andrewsu proposes finding a way to detect when a query is going to return a ridiculously high number of results and return an error (will return too many results, please narrow the scope of the query) and gracefully exit.

Examples include demo queries B.3a (around 100k results), C3 (45k results), and D6 (61k results). Look here for those queries

Andrew proposed that this could be detected during the sub-querying step: when a sub-query returns a massive amount of records. However, there are some concerns:

this would likely need to scale w/ the number of input IDs (1 ID in a subquery returning 10k records is diff from 1000 IDs in a subquery returning 10k records (which then means an average of 10 records per ID)
this wouldn't take into account post-query filtering (node attributes)
this wouldn't take into account query topology: perhaps there are a lot of records / output IDs for a step, but there will be an intersection (with another step or with the qNode's IDs) that will filter / whittle down the records a lot during edge-management...

@ariutta proposed calculating the potential size of the results, right before the results assembly step....and stopping there. The concern there would be that queries may still execute for a while (scale of 10-30 min) before hitting this step, so we may want to find where we can intervene earlier....

Builds on the caps issue and promiscuous node issues, but a different angle + intervention point...

ariutta commented 2 years ago

If we went back to the older way that queryResult.update(records) worked -- where we called it every time we received a partial, incremental batch of records -- we could potentially fail fast if the results exploded, not needing to wait to get all records from every endpoint before failing.

This would require changes to query_results.js and would require that whatever calls queryResult.update would have a mechanism for handling a failed query.

colleenXu commented 2 years ago

note from 12/8 lab meeting: Andrew suggests...

either have strict thresholds (say no we can't handle it) or give an option for user to override (say "no I want the super large response, give it to me")

ariutta commented 2 years ago

Just as an FYI, here's a rough draft paragraph I put together to describe this issue:

Some queries return too many results to be useful and/or practical to handle in memory, and often these queries fail after a long time in processing. For a better user experience and lower load on our compute resources, it would be beneficial for a query of this type to fail early and let the user know that a more tightly constrained query is required. In cases where the number of results is large but still capable of being handled in memory, we may send the user a warning with an option to terminate the query, but in cases where the query is certain to fail, we would send the user an error message and immediately terminate the query. In both cases, the message to the user could include suggestions for how to modify the query to avoid the explosion in the number of results, such as pinpointing which part(s) of the query graph are most responsible for the explosion.

ariutta commented 2 years ago

Leaving this here as a reminder for a future discussion. We may be able to efficiently detect whether a result set will explode by using bloom filters or cuckoo filters. For each endpoint, normalize the IDs it contains, load each set of IDs into a bloom filter and then serialize each bloom filter to save it for future use. Whenever a BTE instance is launched, it will load the serialized bloom filters so it can perform set intersections for the IDs from endpoints for each QNode in order to get an upper limit of number of results that can be returned. It won't give the actual results, just the number of results. The bloom filters will require less memory than trying to actually save the full sets of IDs.

ericz1803 commented 2 years ago

@andrewsu and I had a discussion about this yesterday, a relatively simple solution that we will try and see how it works is to limit the results per edge in call-apis. This takes advantage of the behavior I noted in slack, where the call-apis package buckets the queries into groups to limit it to 3 concurrent calls per api. It then builds up an array of results as these 'query buckets' return results. When the length of this results array reaches a certain threshold, we can stop querying for this edge and send back the partial results. We can then query the next edge using these partial results and continue repeating this process. In theory, if we have a limit of 1000 results per edge, then a query that goes 1 -> 2000 -> 50000 could be turned into 1 -> ~1000 -> ~1000. To give the user control of this, we could let them set this parameter using a query parameter.

@ariutta Interesting idea, I don't know how bloom filters work but do you need to make the queries first to get the number of results or is this done another way?

ariutta commented 2 years ago

For the bloom filter idea, you'd need to do a one-time query somehow to get a list of all IDs from an endpoint. If the endpoint has NCBIGene IDs connected to MONDO IDs, you'd get all the NCBIGene IDs from that endpoint and load them into a bloom filter, serialize that bloom filter and cache it. You'd do the same for all the MONDO IDs. All that would only need to happen once.

The reason for the bloom filter is memory savings. It's not practical to cache everything from every endpoint, but it might be feasible to keep in memory two bloom filters for every endpoint, where each bloom filter could probabilistically tell you whether a given ID is in that endpoint.

Here's an example: take a user query like the one below, where we start with 1 specified ID, get 1k records after e0, 500k records after e1 but just 5k records after e2 because the endpoint associated with e2 had very limited data so most of the paths ran into dead ends. If we had the bloom filters loaded and ready, we could check after getting the 500k records and "look ahead" (without actually querying the endpoint for e2) to see whether it's worth continuing. This "look ahead" would allow us to calculate the upper limit for the number of IDs that intersect at n2 for e1 and e2. We would do it by taking the set of IDs for e1 at n2, loading them into a bloom filter and intersecting with the appropriate cached bloom filter for all IDs at n2 for the endpoint associated with e2. The intersection would give us the number of IDs that are common to n2 for both e1 and e2, so we'd be able to tell whether the 500k would be cut down to something more reasonable.

1        1k      500k      5k
n0 -e0-> n1 -e1-> n2 -e2-> n3

PS: I said the caching would only need to happen once for an endpoint, but more accurately, it would need to happen whenever the endpoint experiences a major change, e.g., if it used to support 2k diseases but now it supports 3k.

ariutta commented 2 years ago

I'm not sure whether it'd be worth it, but we could use the bloom filters before running any queries. For example, we could get the maximum number of IDs that are common to each node for each endpoint. Maybe that could be useful for query planning?

ariutta commented 2 years ago

If we're using multiple endpoints to get records for a single QEdge, the actual cached bloom filters might not be per endpoint. Rather, we might generate two bloom filters for each predicate we support and/or two bloom filters for every pairwise combination of QNode categories that we support.

colleenXu commented 2 years ago

@ericz1803 I have tried the linked PRs and discussed with @andrewsu...there are 3 areas that need clarifying / some more discussion.

First, to clarify:

What is being used for the restriction / limitation? It looks like records (which are later transformed into Edges)...
It would help to have clearer documentation on the PR (see above point). Also, it would help to say that the default is 1000 and one would adjust this by making setting a query parameter as max_results_per_edge (🔺 it says by in the PR's comment...) with value as an int (-1 to retrieve all).

Second, I ran the demo queries and this code does not seem to be working as-intended.

I believe we wanted a limit on queries that had >1000 results or so. However, the limit is kicking in for queries that have <1000 results and leading to fewer results...
This limit doesn't lead to a predictable number of results, which makes it a bit difficult to work with and explain.
Andrew and I were wondering:
- increasing the default limit
- trying a different approach like:
  - doing ID-resolution in increments - which may help us stop at "1000 output nodes" for each hop
  - "incremental results-assembly", which would allow a limit on the number of results

This is my analysis, with particularly concerning ones marked 🔺:

A2: 946 results (VS 4344 w/o restriction*). Hit 1000 limit on 2nd hop.
B.2a: 797 results (VS 1441 w/o restriction). One-hop query.
🔺 B.2b: 613 results (VS 657 w/o restriction). One-hop query.
B.2d: 805 results (VS 1181 w/o restriction). One-hop query.
B.2e: 783 results (VS 1202 w/o restriction). One-hop query.
🔺 C1: 839 results (VS 905* w/o restriction). One-hop query.
🔺 C2: 26 results (VS 93* w/o restriction). Hit 1000 limit on e1?
C3: 613 results (VS thousands w/o restriction). Hit 1000 limit on e3?
🔺 D1: 1 result (VS 743* w/o restriction).
D2: 108 results (VS 2565* w/o restriction).
D6: 76 results (VS thousands w/o restrictions)

the * means that there may be less results depending on sub-queries failing or APIs timing out during sync cron job.

Third, the demo query D3 hits an error....and it looks like it's related to the restriction and getting 0 results

logs

``` bte:call-apis:query id annotation completes +1s bte:call-apis:query Query completes +0ms bte:biothings-explorer-trapi:batch_edge_query BTEEdges are successfully queried.... +9s bte:biothings-explorer-trapi:batch_edge_query Filtering out any "undefined" items in (1000) results +0ms bte:biothings-explorer-trapi:batch_edge_query Total number of results is (1000) +0ms bte:biothings-explorer-trapi:batch_edge_query Start to update nodes... +0ms bte:biothings-explorer-trapi:batch_edge_query Update nodes completed! +1ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Storing results... +9s bte:biothings-explorer-trapi:UpdatedExeEdge (6) Applying Node Constraints to 1000 results. +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) No constraints. Skipping... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating nodes based on edge results... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["SmallMolecule"]" +1ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: ["SmallMolecule"] +4ms bte:biothings-explorer-trapi:NewQNode (8) Node "n00" restored curie. +9s bte:biothings-explorer-trapi:NewQNode Node "n00" intersecting (6)/(314) curies... +0ms bte:biothings-explorer-trapi:NewQNode Node "n00" kept (0) curies... +3ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +4ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["Disease"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: ["Disease"] +12ms bte:biothings-explorer-trapi:NewQNode Node "n01" intersecting (3)/(1) curies... +13ms bte:biothings-explorer-trapi:NewQNode Node "n01" kept (1) curies... +0ms bte:biothings-explorer-trapi:edge-manager 'e00' Reversed[true] (0)--(1) entities / (1000) results. +9s bte:biothings-explorer-trapi:edge-manager 'e00' dropped (1000) results. +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Storing results... +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Applying Node Constraints to 0 results. +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) No constraints. Skipping... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating nodes based on edge results... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["SmallMolecule"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: [] +0ms bte:biothings-explorer-trapi:NewQNode Node "n00" saving (0) curies... +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["Disease"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: [] +0ms bte:biothings-explorer-trapi:NewQNode Node "n01" intersecting (1)/(0) curies... +0ms bte:biothings-explorer-trapi:NewQNode Node "n01" kept (0) curies... +0ms bte:biothings-explorer-trapi:edge-manager Updating all other edges... +0ms bte:biothings-explorer-trapi:main (10) Edge successfully queried. +9s bte:biothings-explorer-trapi:edge-manager (11) Collecting results... +0ms bte:biothings-explorer-trapi:edge-manager (11) 'e00' keeps (0) results! +0ms bte:biothings-explorer-trapi:edge-manager ---------- +0ms bte:biothings-explorer-trapi:edge-manager (12) Edges ["e00"] resulted in (0) results. No complete paths can be formed. +0ms bte:biothings-explorer-trapi:edge-manager (12) Collected results for: []! +0ms bte:biothings-explorer-trapi:edge-manager (12) Collected (0) results! +0ms bte:biothings-explorer-trapi:edge-manager (13) Edge Manager reporting combined results... +1ms bte:biothings-explorer-trapi:Graph Updating BTE Graph now. +8m bte:biothings-explorer-trapi:edge-manager (13) Edge Manager reporting organized results... +0ms bte:biothings-explorer-trapi:QueryResult Updating query results now! +8m bte:biothings-explorer-trapi:error_handler TypeError: Cannot destructure property 'connected_to' of 'dataByEdge[queryEdgeID]' as it is undefined. bte:biothings-explorer-trapi:error_handler at QueryResult._getPreresults (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/query_results.js:109:17) bte:biothings-explorer-trapi:error_handler at QueryResult.update (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/query_results.js:237:14) bte:biothings-explorer-trapi:error_handler at TRAPIQueryHandler.query (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js:185:27) bte:biothings-explorer-trapi:error_handler at runMicrotasks () bte:biothings-explorer-trapi:error_handler at processTicksAndRejections (node:internal/process/task_queues:96:5) bte:biothings-explorer-trapi:error_handler at async task (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/routes/v1/query_v1.js:38:13) bte:biothings-explorer-trapi:error_handler at async /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/controllers/threading/threadHandler.js:86:38 +0ms ```

ericz1803 commented 2 years ago

To address some of your points:

The restriction is being applied on the call-apis results (which is pre-filtering/intersecting). This is why the ones marked with the triangle behave the way they do. Call-apis returns 1000 query results, which then might get dropped/filtered in the process of querying by the query graph handler/edge manager leading to less than 1000 results being sent back to the user.
Increasing the default limit: Easy as changing a number in the config. The thing we would have to consider with finding the ideal limit is balancing getting as many good results back as possible with setting it too high and sort of defeating the purpose if it takes just as long.
D3: You're right, it is being caused by the 1000 query results leading to 0 actual results.

Different approach:

I do agree that it would be better behavior to stop at returning 1000 output nodes back to the user instead of a variable and unpredictable number.
The pros of this current solution is that it is the simplest to implement with the current code and it limits the number of unnecessary api queries which is one of the most time consuming parts. However, the con is unpredictability for the user as you mentioned, sometimes the 1000 gets filtered down to 966 and other times it gets filtered down to 0.
Incremental ID Resolution: Maybe I'm misunderstanding the meaning of doing ID resolution in increments, but right now it is done in call-apis right after getting back the results and it doesn't change the size of the results array sent back to the query graph handler.
Incremental results assembly: I think we would probably have to take the incremental results assembly approach if we want to get this desired behavior. Maybe someone can explain the details of how it would work exactly because I don't understand it too well. My questions for it would be if we can limit the API calls without having to make all of them and how it would perform intersections incrementally.

ariutta commented 2 years ago

Incremental results assembly may be the long-term solution, but I'd be hesitant to try getting that done in the next few days. It could involve quite a few changes to different parts of the codebase.

colleenXu commented 2 years ago

Is there a way to make the limit on the edge dynamic (like 6000 records per input ID)?

The number above comes from what I'm running: I'll post an update soon but the limit needed to get all 16 results for demo query D3 is ~16k (a one-hop explain w/ 6 IDs on one node, 3 IDs on the other node)

andrewsu commented 2 years ago

Call-apis returns 1000 query results, which then might get dropped/filtered in the process of querying by the query graph handler/edge manager leading to less than 1000 results being sent back to the user.

To me, this dropping/filtering step is the key. I was hoping/assuming that the consolidation of edges to results would be at least somewhat predictable; maybe 1000 edges generally gets consolidated to ~900 results. But clearly in practice, that varies quite a bit. In deciding the path forward, I'd like more clarity on two issues:

What is happening during the dropping/filtering step? I was imagining that this was primarily the consolidation of multiple edges/records with the same subject-predicate-object (SPO) triple into a single result, but that wouldn't explain how 1000 edges corresponds to zero results.
What would be the workflow for doing incremental building of results? (And to Anders' point, right, definitely not something we would contemplate doing in a rash manner...)

My guess is that the second item (and possibly the first) will be easiest to discuss live, perhaps at our Wednesday meeting...

ericz1803 commented 2 years ago

I checked a couple of the queries and it looks like you're right, consolidation in the results building step looks like the main reason for the lower number of results. For B.2b, 1000 records are sent to the query results builder which turns them into 613 results sent back to the user. For predict queries (the first 5 that @colleenXu listed), this looks like the main cause. For explain queries (the rest), my guess is that it is a combination of intersecting IDs and this results consolidation. For D3 specifically, it is a one hop explain so that is probably why that is happening.

colleenXu commented 2 years ago

Here's my findings from increasing the limit (see the full google sheet called "limit" here), with key findings with this symbol 📗

My impression is that a flexible limit dependent on the number of input IDs for the edge may be helpful.

For A2 (two-hop Predict), one gets the full results (4344) when 5k < limit <= 16k.
One-hop Predicts (B.2a, B.2b, B.2d, B.2e, C1) had full results when 1k < limit <= 5k. B.2d / C1 had multiple starting IDs on QNodes
For C2 (complicated topology w/ multiple starting IDs on QNodes), one gets the full results (93) when 1k < limit <= 5k.
📗 C3 (complicated topology w/ multiple starting IDs on QNodes) and D6 (3-hop Explain w/ many starting IDs on QNodes) run faster, get fewer results with a limit set
📗 D1 (2-hop Explain, "find everything related to both of these well-annotated diseases") gets full results (743) only when 41k < limit <= 42k
D2 (2-hop explain w/ multiple starting IDs on QNodes) got the full results (2565) when 16k < limit <= 20k
📗 D3 (1-hop Explain w/ multiple starting IDs on QNodes) gets a result/no error at 9k. It gets full results (16) when limit ~ 16k

colleenXu commented 2 years ago

Also! Setting the current limit to "1000 results out of the call-apis module" for a 1-hop Predict query doesn't result in...

1000 output nodes (varied, but < 1000)
1000 edges (varied, but < 1000)
1000 results (already mentioned above, varied, but < 1000)

📗 I found it surprising that it didn't lead to 1000 edges...

I got this info by checking the knowledge-graphs for one-hop predicts (B2a / B2b / B2d / B2e) after setting the limit to 1000:

B2a: 798 nodes / 981 edges (so 797 results)
B2b: 614 nodes / 757 edges (so 613 results)
B2d: 768 nodes / 974 edges (so 805 results - since there were 3 starting chem nodes)
B2e: 784 nodes / 990 edges (so 783 results)
C1: 361 nodes / 923 edges (so 839 results - since there were 11 starting disease nodes and is-set=True is being ignored right now)

colleenXu commented 2 years ago

Notes from meeting:

@ariutta looks at query-results.js in query-handler and adds logs....to help us figure out what is happening to the "1000 results of call-apis module". Specifically, how does this become <1000 edges?
@ericz1803 D3 - error handing for 0 results w/ the limiting code
This is code to explore approaches, not a fast fix or code that will necessarily be pushed to prod.

colleenXu commented 2 years ago

From my perspective...

setting a high limit like 20k might achieve what we want (but we'd have to do a lot of testing with queries to see...)
The current approach seems to revolve around "cutting off" the sub-querying / api calls. However, I'm a bit uncomfortable after seeing it in action: see C2 and D1 for examples.
- For an Explain-type query / more complicated topology where intersections of nodes will happen, it is difficult to predict how high to set the limit to get the nodes that will survive the intersecting process
For the "assemble results as we do the sub-querying for an QEdge", I see a lot of difficulty. I think we hit the same issue as above (we don't know how high to set a limit when an intersection is going to happen)
we may want to use different approaches like setting a lower cap than we currently have or pruning promiscuous nodes at the end of each hop.

ariutta commented 2 years ago

@ericz1803, @colleenXu and @marcodarko, here's a list of lines that might be useful for logging in query_results.js:

L284: debug(`${preresults.length} preresults found`); This is the number of connected subject, predicate, object, ..., subject, predicate, object paths along the query graph. It corresponds to the number of results if we would get if we never did any consolidation.
L448: debug(`${consolidatedPreresults.length} consolidatedPreresults found`); This is the number of results after consolidation.

We could add additional logging if desired to indicate which preresults are consolidated based on multiple edges for the same primary IDs, but that could be overwhelming, because there could be a large number of these. Probably easier to just find them after the fact by looking for results with multiple values in edge_bindings. Similar thought applies for indicating how many records match a given primary ID while traversing the query graph -- we could add logging at L147, but it'd be overwhelming.

ariutta commented 2 years ago

Regarding incremental results assembly, we discussed two ways it could be done:

call queryResult.update every time we get records from an API
call queryResult.update every time we get all records for a QEdge

I think both are technically possible, but 1) would probably be a little more complex in terms of developer time required. If done properly, I think for both cases the runtime performance would be the same or slightly better than our current algorithm.

This is assuming queryResult.update is called for QEdges serially, with records for one QEdge being returned until that QEdge is done, and the order of the QEdges matches the order seen while traversing the query graph starting from a single QNode. Other cases (e.g., calling queryResult.update simultaneously for QEdges on opposite sides of an explain query) are probably possible too, but we'd have to think through the details, and it could get complicated.

We also discussed two issues for why the number of results can be less than the specified cutoff limit -- consolidation and dead-ends.

Regarding consolidation, is this usually a big deal? If we set a limit of 1000, and some queries return 1000 results where each edge_binding has one item but other queries return 500 results where each edge_binding has two items, does that matter?

Regarding dead-ends, this may be uncommon, because it only happens for more complicated query graphs, like the one illustrated in the figure below. In this case, if we made API calls to get records for every QEdge except e3, incremental results assembly would allow us to skip making any API calls to get records with subject U for e3. Without incremental results assembly, we currently would make the "wasted" API call that gets record U->P.

andrewsu commented 1 year ago

not relevant given recently implemented limits on time and record counts

biothings / biothings_explorer

Graceful exit before full execution for queries that would have very large responses #363