biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

implement configurable cap on number of entities being tracked #324

Closed andrewsu closed 2 years ago

andrewsu commented 2 years ago

For longer and/or open-ended queries, the number of entities being tracked by BTE can grow absurdly high. These cases may contribute to out-of-memory errors and server instability. As one possible solution, we could implement a configurable cap on the number of entities being tracked by BTE. If that cap is exceeded at any point in the execution, BTE could respond with an error and gracefully exit.

https://github.com/biothings/BioThings_Explorer_TRAPI/issues/323 may contain a possible example query to test.

colleenXu commented 2 years ago

The options explored were:

  1. Based on input to an edge: When the edge manager is deciding the next edge to do, if the "next best" edge to do has > "some limit" IDs in the input node.....stop execution / return error? saying that a step in the query had > limit IDs as input so it was too large
  2. Based on output of an edge: When the sub-queries are being done, if one returns > "some limit" IDs (after the api-response-transform / ID resolution work?)....stop execution / return error? saying that a step in the query had > limit IDs as output so it was too large
colleenXu commented 2 years ago

Exactly what the "some limit" is depended on which option was chosen and what was happening in the query specified in #323

marcodarko commented 2 years ago

Will be addressed in https://github.com/biothings/bte_trapi_query_graph_handler/pull/53

colleenXu commented 2 years ago

PR has been merged and deployed. Closing...

andrewsu commented 2 years ago

@marcodarko to add a sample query where this threshold will be triggered, and a sample output showing the error message

colleenXu commented 2 years ago

I believe Marco is still working on this issue (quote from Slack)

I'm actually gonna make some changes to the entity max solution, I realized it was getting invoked at the wrong place so it wasn't always checked... so fixing that but also how the error is thrown, I don't think I can send a 200 code error (not sure if possible actually)

colleenXu commented 2 years ago

Other examples

Note that I'm using a local api list (removes pending biothings apis except for clinical risk kp api / multiomics wellness) for all of these examples...

Example 1

This 1-hop query returns just over 1000 IDs (1060)... ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["NCBIGene:7157"], "categories":["biolink:Gene"] }, "n1": { "categories": ["biolink:Disease"] } }, "edges": { "e0": { "subject": "n0", "object": "n1" } } } } } ```

Therefore, we expect an error to be triggered if we add another hop that uses those 1060 IDs as input. This does happen...

the returned response that has the query in it ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": [ "NCBIGene:7157" ], "categories": [ "biolink:Gene" ] }, "n1": { "categories": [ "biolink:Disease" ] }, "n2": { "categories": [ "biolink:PhenotypicFeature" ] } }, "edges": { "e0": { "subject": "n0", "object": "n1" }, "e1": { "subject": "n1", "object": "n2" } } }, "knowledge_graph": { "nodes": {}, "edges": {} }, "results": [] }, "status": 500, "description": "Error: Max number of entities exceeded (1000) in 'e1'" } ```

Example 2

Other queries that correctly trigger the exception are any Workflow B.1 queries with an e03 predict edge - since the number of genes is too large to use as input to another step. Note that this is likely to fail at an earlier edge if the full api list is used...

An example of a query that fails is Demo B.1 [Link](https://github.com/NCATSTranslator/minihackathons/blob/main/2021-12_demo/workflowB/B.1_DILI-three-hop.json) ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["MONDO:0005359", "SNOMEDCT:197354009"], "categories": ["biolink:DiseaseOrPhenotypicFeature"] }, "n1": { "categories": ["biolink:DiseaseOrPhenotypicFeature"] }, "n2": { "categories": ["biolink:Gene"] }, "n3": { "categories": ["biolink:Drug"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:has_real_world_evidence_of_association_with"] }, "e02": { "subject": "n2", "object": "n1", "predicates": ["biolink:gene_associated_with_condition"] }, "e03": { "subject": "n3", "object": "n2", "predicates": ["biolink:affects"] } } } } } ```

A related query to B.1 would previously crash our programs because the computer/server would run out of memory. It now correctly fails...

the query ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["MONDO:0005359", "SNOMEDCT:197354009"], "categories": ["biolink:DiseaseOrPhenotypicFeature"] }, "n1": { "categories": ["biolink:DiseaseOrPhenotypicFeature"] }, "n2": { "categories": ["biolink:Gene"] }, "n3": { "categories": ["biolink:Drug", "biolink:SmallMolecule"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:has_real_world_evidence_of_association_with"] }, "e02": { "subject": "n2", "object": "n1", "predicates": ["biolink:gene_associated_with_condition"] }, "e03": { "subject": "n3", "object": "n2", "predicates": ["biolink:affects", "biolink:interacts_with"] } } } } } ```

Example 3

this query seems to run fully (doesn't hit the error). I believe that's correct because of the filtering down that happens with intersections (Explain style).

the query ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["CHEMBL.COMPOUND:CHEMBL1431"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Protein"] }, "n2": { "categories": ["biolink:Protein"] }, "n3": { "ids": [ "UniProtKB:P02794", "UniProtKB:P02792" ], "categories": ["biolink:Protein"] } }, "edges": { "e0": { "subject": "n0", "object": "n1" }, "e1": { "subject": "n1", "object": "n2" }, "e2": { "subject": "n2", "object": "n3" } } } } } ```
colleenXu commented 2 years ago

Perhaps we could fail earlier in the process - sometimes before the failure point, BTE takes a while with ID resolution because there are >60000 IDs to send to the ID resolver....see https://github.com/biothings/BioThings_Explorer_TRAPI/issues/338#issuecomment-954466062 .

What do you think, @andrewsu @newgene ?

colleenXu commented 2 years ago

Closing this.

After discussion with Andrew, I'll clarify #338 and we'll see how things progress. If needed, we could do a cap (BTE would return failure) related to ID resolution, using a multiplier of this entity cap (like 10,000 - aka 10 * 1000 (entity cap))...