Closed colleenXu closed 1 year ago
If we went back to the older way that queryResult.update(records)
worked -- where we called it every time we received a partial, incremental batch of records -- we could potentially fail fast if the results exploded, not needing to wait to get all records from every endpoint before failing.
This would require changes to query_results.js and would require that whatever calls queryResult.update
would have a mechanism for handling a failed query.
note from 12/8 lab meeting: Andrew suggests...
either have strict thresholds (say no we can't handle it) or give an option for user to override (say "no I want the super large response, give it to me")
Just as an FYI, here's a rough draft paragraph I put together to describe this issue:
Some queries return too many results to be useful and/or practical to handle in memory, and often these queries fail after a long time in processing. For a better user experience and lower load on our compute resources, it would be beneficial for a query of this type to fail early and let the user know that a more tightly constrained query is required. In cases where the number of results is large but still capable of being handled in memory, we may send the user a warning with an option to terminate the query, but in cases where the query is certain to fail, we would send the user an error message and immediately terminate the query. In both cases, the message to the user could include suggestions for how to modify the query to avoid the explosion in the number of results, such as pinpointing which part(s) of the query graph are most responsible for the explosion.
Leaving this here as a reminder for a future discussion. We may be able to efficiently detect whether a result set will explode by using bloom filters or cuckoo filters. For each endpoint, normalize the IDs it contains, load each set of IDs into a bloom filter and then serialize each bloom filter to save it for future use. Whenever a BTE instance is launched, it will load the serialized bloom filters so it can perform set intersections for the IDs from endpoints for each QNode in order to get an upper limit of number of results that can be returned. It won't give the actual results, just the number of results. The bloom filters will require less memory than trying to actually save the full sets of IDs.
@andrewsu and I had a discussion about this yesterday, a relatively simple solution that we will try and see how it works is to limit the results per edge in call-apis. This takes advantage of the behavior I noted in slack, where the call-apis package buckets the queries into groups to limit it to 3 concurrent calls per api. It then builds up an array of results as these 'query buckets' return results. When the length of this results array reaches a certain threshold, we can stop querying for this edge and send back the partial results. We can then query the next edge using these partial results and continue repeating this process. In theory, if we have a limit of 1000 results per edge, then a query that goes 1 -> 2000 -> 50000 could be turned into 1 -> ~1000 -> ~1000. To give the user control of this, we could let them set this parameter using a query parameter.
@ariutta Interesting idea, I don't know how bloom filters work but do you need to make the queries first to get the number of results or is this done another way?
For the bloom filter idea, you'd need to do a one-time query somehow to get a list of all IDs from an endpoint. If the endpoint has NCBIGene IDs connected to MONDO IDs, you'd get all the NCBIGene IDs from that endpoint and load them into a bloom filter, serialize that bloom filter and cache it. You'd do the same for all the MONDO IDs. All that would only need to happen once.
The reason for the bloom filter is memory savings. It's not practical to cache everything from every endpoint, but it might be feasible to keep in memory two bloom filters for every endpoint, where each bloom filter could probabilistically tell you whether a given ID is in that endpoint.
Here's an example: take a user query like the one below, where we start with 1 specified ID, get 1k records after e0
, 500k records after e1
but just 5k records after e2
because the endpoint associated with e2
had very limited data so most of the paths ran into dead ends. If we had the bloom filters loaded and ready, we could check after getting the 500k records and "look ahead" (without actually querying the endpoint for e2
) to see whether it's worth continuing. This "look ahead" would allow us to calculate the upper limit for the number of IDs that intersect at n2
for e1
and e2
. We would do it by taking the set of IDs for e1
at n2
, loading them into a bloom filter and intersecting with the appropriate cached bloom filter for all IDs at n2
for the endpoint associated with e2
. The intersection would give us the number of IDs that are common to n2
for both e1
and e2
, so we'd be able to tell whether the 500k would be cut down to something more reasonable.
1 1k 500k 5k
n0 -e0-> n1 -e1-> n2 -e2-> n3
PS: I said the caching would only need to happen once for an endpoint, but more accurately, it would need to happen whenever the endpoint experiences a major change, e.g., if it used to support 2k diseases but now it supports 3k.
I'm not sure whether it'd be worth it, but we could use the bloom filters before running any queries. For example, we could get the maximum number of IDs that are common to each node for each endpoint. Maybe that could be useful for query planning?
If we're using multiple endpoints to get records for a single QEdge, the actual cached bloom filters might not be per endpoint. Rather, we might generate two bloom filters for each predicate we support and/or two bloom filters for every pairwise combination of QNode categories that we support.
@ericz1803 I have tried the linked PRs and discussed with @andrewsu...there are 3 areas that need clarifying / some more discussion.
First, to clarify:
max_results_per_edge
(🔺 it says by in the PR's comment...) with value as an int (-1 to retrieve all).Second, I ran the demo queries and this code does not seem to be working as-intended.
This is my analysis, with particularly concerning ones marked 🔺:
the * means that there may be less results depending on sub-queries failing or APIs timing out during sync cron job.
Third, the demo query D3 hits an error....and it looks like it's related to the restriction and getting 0 results
To address some of your points:
Different approach:
Incremental results assembly may be the long-term solution, but I'd be hesitant to try getting that done in the next few days. It could involve quite a few changes to different parts of the codebase.
Is there a way to make the limit on the edge dynamic (like 6000 records per input ID)?
The number above comes from what I'm running: I'll post an update soon but the limit needed to get all 16 results for demo query D3 is ~16k (a one-hop explain w/ 6 IDs on one node, 3 IDs on the other node)
Call-apis returns 1000 query results, which then might get dropped/filtered in the process of querying by the query graph handler/edge manager leading to less than 1000 results being sent back to the user.
To me, this dropping/filtering step is the key. I was hoping/assuming that the consolidation of edges to results would be at least somewhat predictable; maybe 1000 edges generally gets consolidated to ~900 results. But clearly in practice, that varies quite a bit. In deciding the path forward, I'd like more clarity on two issues:
My guess is that the second item (and possibly the first) will be easiest to discuss live, perhaps at our Wednesday meeting...
I checked a couple of the queries and it looks like you're right, consolidation in the results building step looks like the main reason for the lower number of results. For B.2b, 1000 records are sent to the query results builder which turns them into 613 results sent back to the user. For predict queries (the first 5 that @colleenXu listed), this looks like the main cause. For explain queries (the rest), my guess is that it is a combination of intersecting IDs and this results consolidation. For D3 specifically, it is a one hop explain so that is probably why that is happening.
Here's my findings from increasing the limit (see the full google sheet called "limit" here), with key findings with this symbol 📗
My impression is that a flexible limit dependent on the number of input IDs for the edge may be helpful.
Also! Setting the current limit to "1000 results out of the call-apis module" for a 1-hop Predict query doesn't result in...
📗 I found it surprising that it didn't lead to 1000 edges...
I got this info by checking the knowledge-graphs for one-hop predicts (B2a / B2b / B2d / B2e) after setting the limit to 1000:
Notes from meeting:
From my perspective...
setting a high limit like 20k might achieve what we want (but we'd have to do a lot of testing with queries to see...)
The current approach seems to revolve around "cutting off" the sub-querying / api calls. However, I'm a bit uncomfortable after seeing it in action: see C2 and D1 for examples.
For the "assemble results as we do the sub-querying for an QEdge", I see a lot of difficulty. I think we hit the same issue as above (we don't know how high to set a limit when an intersection is going to happen)
we may want to use different approaches like setting a lower cap than we currently have or pruning promiscuous nodes at the end of each hop.
@ericz1803, @colleenXu and @marcodarko, here's a list of lines that might be useful for logging in query_results.js:
debug(`${preresults.length} preresults found`);
This is the number of connected subject, predicate, object, ..., subject, predicate, object paths along the query graph. It corresponds to the number of results if we would get if we never did any consolidation.debug(`${consolidatedPreresults.length} consolidatedPreresults found`);
This is the number of results after consolidation.We could add additional logging if desired to indicate which preresults are consolidated based on multiple edges for the same primary IDs, but that could be overwhelming, because there could be a large number of these. Probably easier to just find them after the fact by looking for results with multiple values in edge_bindings
. Similar thought applies for indicating how many records match a given primary ID while traversing the query graph -- we could add logging at L147, but it'd be overwhelming.
Regarding incremental results assembly, we discussed two ways it could be done:
queryResult.update
every time we get records from an APIqueryResult.update
every time we get all records for a QEdgeI think both are technically possible, but 1) would probably be a little more complex in terms of developer time required. If done properly, I think for both cases the runtime performance would be the same or slightly better than our current algorithm.
This is assuming queryResult.update
is called for QEdges serially, with records for one QEdge being returned until that QEdge is done, and the order of the QEdges matches the order seen while traversing the query graph starting from a single QNode. Other cases (e.g., calling queryResult.update
simultaneously for QEdges on opposite sides of an explain query) are probably possible too, but we'd have to think through the details, and it could get complicated.
We also discussed two issues for why the number of results can be less than the specified cutoff limit -- consolidation and dead-ends.
Regarding consolidation, is this usually a big deal? If we set a limit of 1000, and some queries return 1000 results where each edge_binding
has one item but other queries return 500 results where each edge_binding
has two items, does that matter?
Regarding dead-ends, this may be uncommon, because it only happens for more complicated query graphs, like the one illustrated in the figure below. In this case, if we made API calls to get records for every QEdge except e3
, incremental results assembly would allow us to skip making any API calls to get records with subject U
for e3
. Without incremental results assembly, we currently would make the "wasted" API call that gets record U->P
.
not relevant given recently implemented limits on time and record counts
@andrewsu proposes finding a way to detect when a query is going to return a ridiculously high number of results and return an error (will return too many results, please narrow the scope of the query) and gracefully exit.
Examples include demo queries B.3a (around 100k results), C3 (45k results), and D6 (61k results). Look here for those queries
Andrew proposed that this could be detected during the sub-querying step: when a sub-query returns a massive amount of records. However, there are some concerns:
@ariutta proposed calculating the potential size of the results, right before the results assembly step....and stopping there. The concern there would be that queries may still execute for a while (scale of 10-30 min) before hitting this step, so we may want to find where we can intervene earlier....
Builds on the caps issue and promiscuous node issues, but a different angle + intervention point...