[Lens] Add EBT-metric that tracks time spent waiting for ES-request

thomasneirynck commented 3 months ago

Describe the feature:

The current time_to_data metric (https://github.com/elastic/kibana/blob/e12c7e37d41788d65b49e152e925636725d16e5e/x-pack/plugins/lens/public/editor_frame_service/editor_frame/workspace_panel/workspace_panel.tsx#L205) is on the dashboard level and does not include a breakdown of time spent in kibana versus elasticsearch.

Should introduce a new EBT-metric that shows how much time spent is waiting for Elasticsearch.

Describe a specific use case for the feature:

With a more fine grained breakdown of time_to_data (for a dashboard with a single chart), we will have a better understanding of the breakdown what happens on client versus what happens on the server.

elasticmachine commented 3 months ago

Pinging @elastic/kibana-visualizations (Team:Visualizations)

thomasneirynck commented 3 months ago

cc for visibility - @ThomThomson

markov00 commented 2 months ago

I'm looking at this and took a look at the latest performance meeting. We seem to not interpret the took property in the ES response. As written here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#:~:text=The%20took%20time,all%20the%20results.

The took time in the response contains the milliseconds that this request took for processing, beginning quickly after the node received the query, up until all search-related work is done and before the above JSON is returned to the client. This means it includes the time spent waiting in thread pools, executing a distributed search across the whole cluster, and gathering all the results.

So it doesn't contain the ES server response deserialization, and JSON serialization, which could take time. So we can't just use the took and consider it as the time spent by ES but we also need to have their JSON or output format serialization time.

For example, if I use CURL, pushing to /dev/null the response, a took of 35ms results in an average CURL time execution of 100/120 ms. Definitely there is the network latency (even if we are on local machine) but the rest is ES.

curl -o /dev/null -s -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n'  -X POST your query here

Second issue:

the took parameter is returned for each single request, we can have multiple requests per chart. we can't sum up the took time to compute the amount of time spent by ES because its query work is parallelized so one query can be processed within the time frame of another that is taking longer or can also happen after that, if we don't have the starting time of the execution that the took parameter is not enough.
we can also have the "others" request in a subsequent event, usually after receiving the parent's response. In this case, we can sum up the ES took time of the two requests, which is currently non-trivial because I believe we don't propagate already the concept of parent-child requests in our infrastructure.

cc @thomasneirynck @dej611

dej611 commented 2 months ago

I'm looking at this and took a look at the latest performance meeting. We seem to not interpret the took property in the ES response.

We know that, and in the future more will be discerned from the ES client specific events. Here we're talking about removing (some) time spent in ES from the monitored value within Kibana, which would help to understand where the server is actually spending time. I see it like a peeling task, breaking down one layer after another to reach the best precision: the fact is that adding for the moment took to the EBT takes a fraction of what it takes to be more precise.

the took parameter is returned for each single request, we can have multiple requests per chart. we can't sum up the took time to compute the amount of time spent by ES because its query work is parallelized so one query can be processed within the time frame of another that is taking longer or can also happen after that, if we don't have the starting time of the execution that the took parameter is not enough.

I think this is something we can discuss as a follow up: all our Lens telemetry is based on the assumption of a single chart rendering within the editor. Even assuming the same start time of all the queries within a dashboard (approx 1ms) we have the same problem with all the telemetry events about this.

we can also have the "others" request in a subsequent event, usually after receiving the parent's response. In this case, we can sum up the ES took time of the two requests, which is currently non-trivial because I believe we don't propagate already the concept of parent-child requests in our infrastructure.

True, but I think it is possible to enrich our request adapter to make it aware of a relationship between the Other and its parent request.

thomasneirynck commented 2 months ago

I agree with @dej611 overall.

The took-parameter should give us some measure of time spent in Elasticsearch, even if it ignores additional overhead of serialization or network time. I like the peeling analogy here. We're getting deeper at least one more layer, even if there are still others to explore.

wrt:

inability to sum: this is the nature of dashboards and a tough nut to crack (how to interpret time spent in parallel (?)). To be able to do a more conventional "stacking" of metrics, we do have the lens_stress journey, which issues only a single query.
multiple requests (e.g. other bucket)/(and async polling in general for that matter): this is a tough one too, for similar reasons.

I would propose we tackle these separately, since these are difficult dashboard-specific issues.

I like that this PR moves us forward by introducing a well-defined metric. It should give us a reliable indication of how "heavy" the ES-request weigh, even with the understanding we don't have full visibility in all layers. time_to_data still captures the "whole onion", so we should not lose information either.

thomasneirynck commented 1 month ago

Wrt https://github.com/elastic/kibana/pull/192245, I think this only partially captures what we need.

While this new metric is very useful, the context in which it is collected is only the Lens Editor. This is limiting.

We also need to be able to track this metric for all Lens-charts, wherever they are rendered, and not just in the editor.

This is mainly important for Dashboards, as this is where the majority of charts will be created and viewed when in-line editing becomes more prevalent. Most Lens-chart are becoming to be embedded by-value into Dashboards.

All the performance benchmarking journeys are run against Dashboards as well (and not just Lens). We need to be able to collorate this new metric against existing benchmark data.

teresaalvarezsoler commented 1 month ago

Totally agree @thomasneirynck it was never mentioned this was only collecting the metrics in the Lens editor. I think we discussed this was necessary for our CI workflows which track dashboard loading performance. Is this metric going to be able to capture any data at all when integrated in these workflows?

If it's not, can we create another issue where we better describe why we want this and where it's going to be used?

thomasneirynck commented 1 month ago

@teresaalvarezsoler

Is this metric going to be able to capture any data at all when integrated in these workflows?

https://github.com/elastic/kibana/pull/192245 will not do this, but created https://github.com/elastic/kibana/pull/192743. This will track these timestamps anywhere Lens is embedded.

markov00 commented 1 month ago

done in https://github.com/elastic/kibana/pull/192743

elastic / kibana

[Lens] Add EBT-metric that tracks time spent waiting for ES-request #189396