how to refine a two-hop query that explodes on the first edge?

andrewsu commented 2 years ago

I executed the following two-hop query. The number of entities exceeds our cap after executing the first edge, and BTE returns essentially an empty result (no results, no KG). We should consider providing some results back so that the user can adjust the query so it can successfully finish (by adding predicates, for example). Desired behavior needs some discussion...

(I would submit an ARS link, but I'm having issues running queries at the moment? Could be something unrelated to this specific issue?)

{
    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "object": "n1",
                    "subject": "n0"
                },
                "e1": {
                    "object": "n2",
                    "subject": "n1"
                }
            },
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "name": "disease or disorder"
                },
                "n1": {
                    "categories": [
                        "biolink:Protein",
                        "biolink:Gene"
                    ],
                    "name": "Protein"
                },
                "n2": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MONDO:0005083"],
                    "name": "psoriasis"
                }
            }
        }
    }
}

colleenXu commented 2 years ago

This was the intended behavior regarding #324. The TRAPI logs usually note that there were too many entities after the first hop to continue.

Perhaps we could return a different error code (not 200) - to make it clearer that there was an issue?

colleenXu commented 2 years ago

I'm not sure about returning results back because we wouldn't have completed the query graph (so the records / stuff available after the first hop wouldn't fully map onto the query graph or provide the answers asked for in the query graph)...

colleenXu commented 2 years ago

More info:

Locally, this query runs for 42 sec. It stops after the first hop because 2421 entities (>1000) were going to go into the second hop (Gene/Protein -> Disease). I think it's reasonable that we don't do the second hop, since I anticipate that it will take a very long time to run and give a gigantic response...
What would be informative to the user? Maybe telling them to:
- more specific predicates. Maybe tell them what predicates existed in this hop?
- more specific node-categories
- more specific starting IDs in their query?
- Maybe telling them the actual number of entities going into the hop (we currently only say it was more than 1000)...
- Explaining why we stopped (going to another hop)?
- (future) constraints on nodes or edges
If there was a KG section and no results, would the collection-results page still show the predicate / source breakdown (mouse-over)? I'll ask ARAX UI's team...

The API response: response.txt

Console logs: Screen Shot 2022-08-18 at 7 22 44 PM

andrewsu commented 2 years ago

@colleenXu What do you think about returning the message.knowledge_graph portion of the response with the results of the first hop? The message.results section would still be empty.

colleenXu commented 2 years ago

first step: run the first hop and see if there are different predicates...that'll help us tell if a log listing the predicates in the executed hop(s) would be useful or not.

Remember that this is just a two-hop, but this can happen in longer linear queries as well...

tokebe commented 1 year ago

In discussion with @colleenXu, two things to change:

Make the existing TRAPI log more specific
- Show the number of entities retrieved, state that it exceeds 1000
- Explain that this number is too high to go into another open-ended edge
- Suggest the user specify more specific predicates/categories/ids
Add another log summarizing the top 10 predicates and their frequency from the retrieved entities

andrewsu commented 1 year ago

Revisiting this issue... I ran the two-hop query above through the ARS: https://arax.ci.transltr.io/?r=cbc0e82e-8397-4293-b11c-00e40859169a. (EDIT: this link actually corresponds to the query in the related issue #330 on Fanconi anemia, not the psoriasis query above.) As designed, it returns zero results with the following error message:

Error: Max number of entities exceeded (1000) in 'e02'

The one-hop query for e01 indeed returns 1022 results: https://arax.ci.transltr.io/?r=65737549-f327-4ff7-9006-9d0ab4daf236. The validator (results injected by the ARS) returns some useful stats -- we should consider returning this info directly in the logs (as suggested in the comment above):

  "validation_result": {
    "message": "There were validator errors",
    "n_edges": 2054,
    "n_nodes": 1044,
    "provenance_summary": {
      "n_sources": 26,
      "predicate_counts": {
        "biolink:affected_by": 2,
        "biolink:caused_by": 60,
        "biolink:condition_associated_with_gene": 379,
        "biolink:contribution_from": 928,
        "biolink:occurs_together_in_literature_with": 483,
        "biolink:related_to": 181,
        "biolink:subclass_of": 21
      },

In addition, with the completion of the scoring overhaul in #634, the ranking of the 1022 results actually looks pretty good. So in the case where e01 retrieves more than our limit of entities, should we simply calculate scores for all intermediate answers, trim to the max allowed entities, and then continue with e02? Of course we'd want to return some sort of warning, but my guess is that this must be what the other ARAs are doing in response to the two-hop query above...

andrewsu commented 1 year ago

There's also a separate question of prioritization. Does (or can) sentry track how often we hit this type of limit?

tokebe commented 1 year ago

Sentry unfortunately doesn't seem to provide a solid way of searching for specific errors, so it's hard to track frequency of specific kinds of failures.

Anecdotally, we see this kind of error relatively frequently in the queues for sync queries to the /v1/team/Service Provider/query endpoint.

biothings / biothings_explorer

how to refine a two-hop query that explodes on the first edge? #493