biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 10 forks source link

how to refine a two-hop query that explodes on the first edge? #493

Open andrewsu opened 2 years ago

andrewsu commented 2 years ago

I executed the following two-hop query. The number of entities exceeds our cap after executing the first edge, and BTE returns essentially an empty result (no results, no KG). We should consider providing some results back so that the user can adjust the query so it can successfully finish (by adding predicates, for example). Desired behavior needs some discussion...

(I would submit an ARS link, but I'm having issues running queries at the moment? Could be something unrelated to this specific issue?)

{
    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "object": "n1",
                    "subject": "n0"
                },
                "e1": {
                    "object": "n2",
                    "subject": "n1"
                }
            },
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "name": "disease or disorder"
                },
                "n1": {
                    "categories": [
                        "biolink:Protein",
                        "biolink:Gene"
                    ],
                    "name": "Protein"
                },
                "n2": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MONDO:0005083"],
                    "name": "psoriasis"
                }
            }
        }
    }
}
colleenXu commented 2 years ago

This was the intended behavior regarding #324. The TRAPI logs usually note that there were too many entities after the first hop to continue.

Perhaps we could return a different error code (not 200) - to make it clearer that there was an issue?

colleenXu commented 2 years ago

I'm not sure about returning results back because we wouldn't have completed the query graph (so the records / stuff available after the first hop wouldn't fully map onto the query graph or provide the answers asked for in the query graph)...

colleenXu commented 2 years ago

More info:

The API response: response.txt

Console logs: Screen Shot 2022-08-18 at 7 22 44 PM

andrewsu commented 2 years ago

@colleenXu What do you think about returning the message.knowledge_graph portion of the response with the results of the first hop? The message.results section would still be empty.

colleenXu commented 2 years ago

first step: run the first hop and see if there are different predicates...that'll help us tell if a log listing the predicates in the executed hop(s) would be useful or not.

Remember that this is just a two-hop, but this can happen in longer linear queries as well...

tokebe commented 1 year ago

In discussion with @colleenXu, two things to change:

andrewsu commented 1 year ago

Revisiting this issue... I ran the two-hop query above through the ARS: https://arax.ci.transltr.io/?r=cbc0e82e-8397-4293-b11c-00e40859169a. (EDIT: this link actually corresponds to the query in the related issue #330 on Fanconi anemia, not the psoriasis query above.) As designed, it returns zero results with the following error message:

Error: Max number of entities exceeded (1000) in 'e02'

The one-hop query for e01 indeed returns 1022 results: https://arax.ci.transltr.io/?r=65737549-f327-4ff7-9006-9d0ab4daf236. The validator (results injected by the ARS) returns some useful stats -- we should consider returning this info directly in the logs (as suggested in the comment above):

  "validation_result": {
    "message": "There were validator errors",
    "n_edges": 2054,
    "n_nodes": 1044,
    "provenance_summary": {
      "n_sources": 26,
      "predicate_counts": {
        "biolink:affected_by": 2,
        "biolink:caused_by": 60,
        "biolink:condition_associated_with_gene": 379,
        "biolink:contribution_from": 928,
        "biolink:occurs_together_in_literature_with": 483,
        "biolink:related_to": 181,
        "biolink:subclass_of": 21
      },

In addition, with the completion of the scoring overhaul in #634, the ranking of the 1022 results actually looks pretty good. So in the case where e01 retrieves more than our limit of entities, should we simply calculate scores for all intermediate answers, trim to the max allowed entities, and then continue with e02? Of course we'd want to return some sort of warning, but my guess is that this must be what the other ARAs are doing in response to the two-hop query above...

andrewsu commented 1 year ago

There's also a separate question of prioritization. Does (or can) sentry track how often we hit this type of limit?

tokebe commented 1 year ago

Sentry unfortunately doesn't seem to provide a solid way of searching for specific errors, so it's hard to track frequency of specific kinds of failures.

Anecdotally, we see this kind of error relatively frequently in the queues for sync queries to the /v1/team/Service Provider/query endpoint.