Search API response time breakdown

PhaedrusTheGreek commented 8 years ago

There is currently no easy way to troubleshoot which phase of an API request is taking what amount of time.

It is possible to see a took time of a few milliseconds, but the actual API response takes 60 seconds. This can happen when there are network problems between nodes, and also in the case when integrated components such as AD or LDAP are slow to respond, or are timing out. In some cases there is no logging or other way to identify slow phases.

It would be awesome if we could do:

GET /index/_search?vtook

{
   "vtook": {
       "total_api" : 60000
       "security/ldap_realm" : 59980
       "total_query": 20
       "fetch" : 10
       "collate": 10
       "threadpool_queue" : 0
    }
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1000000,
      "max_score": 1,
      "hits": [
        ...
      ]
   }
}

jpountz commented 8 years ago

If we do this, I think this would belong to the profile API.

evanvolgas commented 7 years ago

The Profile API makes sense but it would also be pretty useful if you could log this and do reporting on it, not just profile one-off cases. For example, maybe there could be a sample threshold in which every Nth query gets profiled with vtook and logged. That sample threshold parameter would be quite handy in general. To accomplish that now, I have to use tools like goreplay and rewrite the requests that I want to sample and profile.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

polyfractal commented 6 years ago

Chatted about this in Fixit ~Friday~ Thursday. Adding some of the extra search phases (fetch, aggregation reduce, etc) wouldn't be too bad, just needs some work to implement.

Adding non-search phases like security or general network stages would probably need a dedicated framework to hook into phases of a request's lifecycle. The current search profiler is pretty hardwired to just work with Search... extending it would likely be more hassle than just sitting down to work on an overarching framework.

@PhaedrusTheGreek Was the interest more in the non-search phases (security, queuing, etc), adding the additional search phases (fetch, agg reduce, etc) or all the above??

PhaedrusTheGreek commented 6 years ago

@polyfractal this matter has come up again today, in regards to wanting to understand how much time a query spends in the queue. Previously, the matter was raised by a ~1m LDAP authentication delay.

PhaedrusTheGreek commented 5 years ago

@polyfractal - adding a network stage metric could be used to assist/resolve this matter, closed in favour of https://github.com/elastic/elasticsearch/issues/36127

{
   "vtook": {
       "total" : 1024,
       "network" : 1020,
       "security" : 0,
       "query": 4,
       "fetch" : 2,
       "collate": 2,
       "threadpool_queue" : 0
    }
}

I wonder if something like this could be added to indices stats also, if so it would tie into monitoring well

polyfractal commented 5 years ago

We're still unfortunately at where https://github.com/elastic/elasticsearch/issues/21073#issuecomment-406415210 left off... the profiler framework is purely in the Search Phase and would need some fairly substantial overhaul to take into account other things like network. :(

I still think this is an enhancement that the Profiler should get, it'd be very useful (and also related to https://github.com/elastic/elasticsearch/issues/23114 which could provide some of this data too)

That said, the new Adaptive Replica Selection sorta provides these details, albeit in a more roundabout manner.

E.g. ARS tracks the overall latencies between nodes so that it can choose better replicas for search.

"adaptive_selection" : {
        "<nodeID>" : {
          "outgoing_searches" : 0,
          "avg_queue_size" : 0,
          "avg_service_time" : "342.5ms",
          "avg_service_time_ns" : 342511123,
          "avg_response_time" : "1.2ms",
          "avg_response_time_ns" : 1256317,
          "rank" : "1.3"
        },
}

Which essentially represents the node-to-node service and response time from a 10,000ft view. Doesn't help with individual queries, but might provide some insight into the effect of network (and load) on the cluster.

PhaedrusTheGreek commented 5 years ago

Another option to track the problem of unresponsive authentication services could be to track timeouts/errors in some counter instead of having to rely on finding errors in the logs.

kaliljoao commented 4 years ago

Hi guys, can I work on this issue?

PhaedrusTheGreek commented 3 years ago

Surfacing this information could potentially be solved with APM via https://github.com/elastic/apm-agent-java/issues/1561

javanna commented 2 years ago

I believe that what will be implemented as part of #84369 will help solving this issue. We'll have to evaluate whether more search specific info needs to be exposed as a follow-up.

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elastic / elasticsearch

Search API response time breakdown #21073