biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Graceful exit before full execution for queries that would have very large responses #363

Closed colleenXu closed 1 year ago

colleenXu commented 2 years ago

@andrewsu proposes finding a way to detect when a query is going to return a ridiculously high number of results and return an error (will return too many results, please narrow the scope of the query) and gracefully exit.

Examples include demo queries B.3a (around 100k results), C3 (45k results), and D6 (61k results). Look here for those queries

Andrew proposed that this could be detected during the sub-querying step: when a sub-query returns a massive amount of records. However, there are some concerns:

@ariutta proposed calculating the potential size of the results, right before the results assembly step....and stopping there. The concern there would be that queries may still execute for a while (scale of 10-30 min) before hitting this step, so we may want to find where we can intervene earlier....


Builds on the caps issue and promiscuous node issues, but a different angle + intervention point...

ariutta commented 2 years ago

If we went back to the older way that queryResult.update(records) worked -- where we called it every time we received a partial, incremental batch of records -- we could potentially fail fast if the results exploded, not needing to wait to get all records from every endpoint before failing.

This would require changes to query_results.js and would require that whatever calls queryResult.update would have a mechanism for handling a failed query.

colleenXu commented 2 years ago

note from 12/8 lab meeting: Andrew suggests...

either have strict thresholds (say no we can't handle it) or give an option for user to override (say "no I want the super large response, give it to me")

ariutta commented 2 years ago

Just as an FYI, here's a rough draft paragraph I put together to describe this issue:

Some queries return too many results to be useful and/or practical to handle in memory, and often these queries fail after a long time in processing. For a better user experience and lower load on our compute resources, it would be beneficial for a query of this type to fail early and let the user know that a more tightly constrained query is required. In cases where the number of results is large but still capable of being handled in memory, we may send the user a warning with an option to terminate the query, but in cases where the query is certain to fail, we would send the user an error message and immediately terminate the query. In both cases, the message to the user could include suggestions for how to modify the query to avoid the explosion in the number of results, such as pinpointing which part(s) of the query graph are most responsible for the explosion.

ariutta commented 2 years ago

Leaving this here as a reminder for a future discussion. We may be able to efficiently detect whether a result set will explode by using bloom filters or cuckoo filters. For each endpoint, normalize the IDs it contains, load each set of IDs into a bloom filter and then serialize each bloom filter to save it for future use. Whenever a BTE instance is launched, it will load the serialized bloom filters so it can perform set intersections for the IDs from endpoints for each QNode in order to get an upper limit of number of results that can be returned. It won't give the actual results, just the number of results. The bloom filters will require less memory than trying to actually save the full sets of IDs.

ericz1803 commented 2 years ago

@andrewsu and I had a discussion about this yesterday, a relatively simple solution that we will try and see how it works is to limit the results per edge in call-apis. This takes advantage of the behavior I noted in slack, where the call-apis package buckets the queries into groups to limit it to 3 concurrent calls per api. It then builds up an array of results as these 'query buckets' return results. When the length of this results array reaches a certain threshold, we can stop querying for this edge and send back the partial results. We can then query the next edge using these partial results and continue repeating this process. In theory, if we have a limit of 1000 results per edge, then a query that goes 1 -> 2000 -> 50000 could be turned into 1 -> ~1000 -> ~1000. To give the user control of this, we could let them set this parameter using a query parameter.

@ariutta Interesting idea, I don't know how bloom filters work but do you need to make the queries first to get the number of results or is this done another way?

ariutta commented 2 years ago

For the bloom filter idea, you'd need to do a one-time query somehow to get a list of all IDs from an endpoint. If the endpoint has NCBIGene IDs connected to MONDO IDs, you'd get all the NCBIGene IDs from that endpoint and load them into a bloom filter, serialize that bloom filter and cache it. You'd do the same for all the MONDO IDs. All that would only need to happen once.

The reason for the bloom filter is memory savings. It's not practical to cache everything from every endpoint, but it might be feasible to keep in memory two bloom filters for every endpoint, where each bloom filter could probabilistically tell you whether a given ID is in that endpoint.

Here's an example: take a user query like the one below, where we start with 1 specified ID, get 1k records after e0, 500k records after e1 but just 5k records after e2 because the endpoint associated with e2 had very limited data so most of the paths ran into dead ends. If we had the bloom filters loaded and ready, we could check after getting the 500k records and "look ahead" (without actually querying the endpoint for e2) to see whether it's worth continuing. This "look ahead" would allow us to calculate the upper limit for the number of IDs that intersect at n2 for e1 and e2. We would do it by taking the set of IDs for e1 at n2, loading them into a bloom filter and intersecting with the appropriate cached bloom filter for all IDs at n2 for the endpoint associated with e2. The intersection would give us the number of IDs that are common to n2 for both e1 and e2, so we'd be able to tell whether the 500k would be cut down to something more reasonable.

1        1k      500k      5k
n0 -e0-> n1 -e1-> n2 -e2-> n3

PS: I said the caching would only need to happen once for an endpoint, but more accurately, it would need to happen whenever the endpoint experiences a major change, e.g., if it used to support 2k diseases but now it supports 3k.

ariutta commented 2 years ago

I'm not sure whether it'd be worth it, but we could use the bloom filters before running any queries. For example, we could get the maximum number of IDs that are common to each node for each endpoint. Maybe that could be useful for query planning?

ariutta commented 2 years ago

If we're using multiple endpoints to get records for a single QEdge, the actual cached bloom filters might not be per endpoint. Rather, we might generate two bloom filters for each predicate we support and/or two bloom filters for every pairwise combination of QNode categories that we support.

colleenXu commented 2 years ago

@ericz1803 I have tried the linked PRs and discussed with @andrewsu...there are 3 areas that need clarifying / some more discussion.

First, to clarify:


Second, I ran the demo queries and this code does not seem to be working as-intended.

This is my analysis, with particularly concerning ones marked 🔺:

the * means that there may be less results depending on sub-queries failing or APIs timing out during sync cron job.


Third, the demo query D3 hits an error....and it looks like it's related to the restriction and getting 0 results

logs ``` bte:call-apis:query id annotation completes +1s bte:call-apis:query Query completes +0ms bte:biothings-explorer-trapi:batch_edge_query BTEEdges are successfully queried.... +9s bte:biothings-explorer-trapi:batch_edge_query Filtering out any "undefined" items in (1000) results +0ms bte:biothings-explorer-trapi:batch_edge_query Total number of results is (1000) +0ms bte:biothings-explorer-trapi:batch_edge_query Start to update nodes... +0ms bte:biothings-explorer-trapi:batch_edge_query Update nodes completed! +1ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Storing results... +9s bte:biothings-explorer-trapi:UpdatedExeEdge (6) Applying Node Constraints to 1000 results. +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) No constraints. Skipping... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating nodes based on edge results... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["SmallMolecule"]" +1ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: ["SmallMolecule"] +4ms bte:biothings-explorer-trapi:NewQNode (8) Node "n00" restored curie. +9s bte:biothings-explorer-trapi:NewQNode Node "n00" intersecting (6)/(314) curies... +0ms bte:biothings-explorer-trapi:NewQNode Node "n00" kept (0) curies... +3ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +4ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["Disease"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: ["Disease"] +12ms bte:biothings-explorer-trapi:NewQNode Node "n01" intersecting (3)/(1) curies... +13ms bte:biothings-explorer-trapi:NewQNode Node "n01" kept (1) curies... +0ms bte:biothings-explorer-trapi:edge-manager 'e00' Reversed[true] (0)--(1) entities / (1000) results. +9s bte:biothings-explorer-trapi:edge-manager 'e00' dropped (1000) results. +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Storing results... +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) Applying Node Constraints to 0 results. +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (6) No constraints. Skipping... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating nodes based on edge results... +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["SmallMolecule"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: [] +0ms bte:biothings-explorer-trapi:NewQNode Node "n00" saving (0) curies... +12ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Updating Entities in "e00" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge (7) Collecting Types: "["Disease"]" +0ms bte:biothings-explorer-trapi:UpdatedExeEdge Collected entity ids in results: [] +0ms bte:biothings-explorer-trapi:NewQNode Node "n01" intersecting (1)/(0) curies... +0ms bte:biothings-explorer-trapi:NewQNode Node "n01" kept (0) curies... +0ms bte:biothings-explorer-trapi:edge-manager Updating all other edges... +0ms bte:biothings-explorer-trapi:main (10) Edge successfully queried. +9s bte:biothings-explorer-trapi:edge-manager (11) Collecting results... +0ms bte:biothings-explorer-trapi:edge-manager (11) 'e00' keeps (0) results! +0ms bte:biothings-explorer-trapi:edge-manager ---------- +0ms bte:biothings-explorer-trapi:edge-manager (12) Edges ["e00"] resulted in (0) results. No complete paths can be formed. +0ms bte:biothings-explorer-trapi:edge-manager (12) Collected results for: []! +0ms bte:biothings-explorer-trapi:edge-manager (12) Collected (0) results! +0ms bte:biothings-explorer-trapi:edge-manager (13) Edge Manager reporting combined results... +1ms bte:biothings-explorer-trapi:Graph Updating BTE Graph now. +8m bte:biothings-explorer-trapi:edge-manager (13) Edge Manager reporting organized results... +0ms bte:biothings-explorer-trapi:QueryResult Updating query results now! +8m bte:biothings-explorer-trapi:error_handler TypeError: Cannot destructure property 'connected_to' of 'dataByEdge[queryEdgeID]' as it is undefined. bte:biothings-explorer-trapi:error_handler at QueryResult._getPreresults (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/query_results.js:109:17) bte:biothings-explorer-trapi:error_handler at QueryResult.update (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/query_results.js:237:14) bte:biothings-explorer-trapi:error_handler at TRAPIQueryHandler.query (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js:185:27) bte:biothings-explorer-trapi:error_handler at runMicrotasks () bte:biothings-explorer-trapi:error_handler at processTicksAndRejections (node:internal/process/task_queues:96:5) bte:biothings-explorer-trapi:error_handler at async task (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/routes/v1/query_v1.js:38:13) bte:biothings-explorer-trapi:error_handler at async /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/controllers/threading/threadHandler.js:86:38 +0ms ```
ericz1803 commented 2 years ago

To address some of your points:

Different approach:

ariutta commented 2 years ago

Incremental results assembly may be the long-term solution, but I'd be hesitant to try getting that done in the next few days. It could involve quite a few changes to different parts of the codebase.

colleenXu commented 2 years ago

Is there a way to make the limit on the edge dynamic (like 6000 records per input ID)?

The number above comes from what I'm running: I'll post an update soon but the limit needed to get all 16 results for demo query D3 is ~16k (a one-hop explain w/ 6 IDs on one node, 3 IDs on the other node)

andrewsu commented 2 years ago

Call-apis returns 1000 query results, which then might get dropped/filtered in the process of querying by the query graph handler/edge manager leading to less than 1000 results being sent back to the user.

To me, this dropping/filtering step is the key. I was hoping/assuming that the consolidation of edges to results would be at least somewhat predictable; maybe 1000 edges generally gets consolidated to ~900 results. But clearly in practice, that varies quite a bit. In deciding the path forward, I'd like more clarity on two issues:

My guess is that the second item (and possibly the first) will be easiest to discuss live, perhaps at our Wednesday meeting...

ericz1803 commented 2 years ago

I checked a couple of the queries and it looks like you're right, consolidation in the results building step looks like the main reason for the lower number of results. For B.2b, 1000 records are sent to the query results builder which turns them into 613 results sent back to the user. For predict queries (the first 5 that @colleenXu listed), this looks like the main cause. For explain queries (the rest), my guess is that it is a combination of intersecting IDs and this results consolidation. For D3 specifically, it is a one hop explain so that is probably why that is happening.

colleenXu commented 2 years ago

Here's my findings from increasing the limit (see the full google sheet called "limit" here), with key findings with this symbol 📗

My impression is that a flexible limit dependent on the number of input IDs for the edge may be helpful.

colleenXu commented 2 years ago

Also! Setting the current limit to "1000 results out of the call-apis module" for a 1-hop Predict query doesn't result in...

📗 I found it surprising that it didn't lead to 1000 edges...


I got this info by checking the knowledge-graphs for one-hop predicts (B2a / B2b / B2d / B2e) after setting the limit to 1000:

colleenXu commented 2 years ago

Notes from meeting:

colleenXu commented 2 years ago

From my perspective...

ariutta commented 2 years ago

@ericz1803, @colleenXu and @marcodarko, here's a list of lines that might be useful for logging in query_results.js:

We could add additional logging if desired to indicate which preresults are consolidated based on multiple edges for the same primary IDs, but that could be overwhelming, because there could be a large number of these. Probably easier to just find them after the fact by looking for results with multiple values in edge_bindings. Similar thought applies for indicating how many records match a given primary ID while traversing the query graph -- we could add logging at L147, but it'd be overwhelming.

ariutta commented 2 years ago

Regarding incremental results assembly, we discussed two ways it could be done:

  1. call queryResult.update every time we get records from an API
  2. call queryResult.update every time we get all records for a QEdge

I think both are technically possible, but 1) would probably be a little more complex in terms of developer time required. If done properly, I think for both cases the runtime performance would be the same or slightly better than our current algorithm.

This is assuming queryResult.update is called for QEdges serially, with records for one QEdge being returned until that QEdge is done, and the order of the QEdges matches the order seen while traversing the query graph starting from a single QNode. Other cases (e.g., calling queryResult.update simultaneously for QEdges on opposite sides of an explain query) are probably possible too, but we'd have to think through the details, and it could get complicated.

We also discussed two issues for why the number of results can be less than the specified cutoff limit -- consolidation and dead-ends.

Regarding consolidation, is this usually a big deal? If we set a limit of 1000, and some queries return 1000 results where each edge_binding has one item but other queries return 500 results where each edge_binding has two items, does that matter?

Regarding dead-ends, this may be uncommon, because it only happens for more complicated query graphs, like the one illustrated in the figure below. In this case, if we made API calls to get records for every QEdge except e3, incremental results assembly would allow us to skip making any API calls to get records with subject U for e3. Without incremental results assembly, we currently would make the "wasted" API call that gets record U->P. image

andrewsu commented 1 year ago

not relevant given recently implemented limits on time and record counts