Overall Search Latency Per Query?

elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

68.76k stars 24.43k forks source link

Overall Search Latency Per Query? #42788

Closed atris closed 2 months ago

atris commented 5 years ago

I was looking around for a metric that allows seeing a value of the overall latency that a query is entailing. By overall, I mean the latency calculated right from the moment the query lands into the coordinating node's execution (not counting the queue time) till the point all QueryPhases respond back.

Is there something that exists today and/or a plan to add something on these lines?

martijnvg commented 5 years ago

The nodes / indices stats apis and _cat apis do report the total time spent and number of executions for the query phase (query_time / query_total) and fetch phases (fetch_time / fetch_total). Is this what you're referring to?

elasticmachine commented 5 years ago

Pinging @elastic/es-core-features

atris commented 5 years ago

@martijnvg That would not give me a clear picture of the average time a query is spending in the system (right from when it lands in the coordinating node to when it gets the final response). I was looking for a more e-2-e metric that is calculated per query.

martijnvg commented 5 years ago

(right from when it lands in the coordinating node to when it gets the final response)

So the search api does a return a took field in the response. Which is the measured time from when the coordinated node received the request until the final response is returned. This is per search request.

atris commented 5 years ago

Is there a way to see that value from an index perspective? I am trying to quantify the impact of a potential server change, hence the question

martijnvg commented 5 years ago

No, took field is not being kept track of. However keeping track of it per index is not really possible, because a search request can span across multiple indices.

The overall query and fetch phase stats I mentioned in previous comment are being kept track of per shard (and thus per index, via indices stats api), and I think that is interesting to monitor in order to see whether a particular change has effect. Typically the query phase is the phase of a search request that is the heaviest. Changes in the query phase time will have an impact on overall search service time.

atris commented 5 years ago

Agreed, the query phase time tracking is useful and does reflect expected patterns for the change. However, it does not really always translate into direct customer facing latency changes, hence my quest for a metric which can accurately answer that question

dakrone commented 5 years ago

@atris I think there might be a way you can get this information. We store these stats as an exponentially weighted moving average as part of the adaptive replica selection stats in the nodes stats output, see: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/cluster-nodes-stats.html#adaptive-selection-stats

Since this has both the service and response times, I think it should work for your use case.

wangkhc commented 4 years ago

@dakrone I think collecting overall search latency per query does make sense in many scenario. For example, a query with heavy aggregation calculation happening on coordinating node, which is not being recorded in stats, but useful for performance analyzation. I'm already working on this over-all coordinating node stats. If interested, I can contribute this feature to es community.

larytet commented 3 years ago

I am looking for end-to-end monitor for the Elastic as well. My goal is to monitor the time it takes from adding a new document to the ES and to the availability of the document for search on a different node. I am thinking to write a dedicated service monitoring the ES, but I hope there is something I can reuse.

The idea is:

One pod (producer) wakes up and starts to store small objects every second or so in the ES. The object contains an exact timestamp, pod name. The key is a timestamp rounded to the nearest 1s.
Producer removes entries older than 5 minutes from the ES
Another pod (consumer) perform search for the keys every 1s
There is an alert if a key is not available for search for longer than 5s.
Consumer stops checking the key after 5s timeout
Consumer is expected to search for more than one key in parallel.

dakrone commented 2 months ago

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.