elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.67k stars 8.23k forks source link

[Profiling] Can we reduce server-side processing time for flamegraphs? #176208

Open danielmitterdorfer opened 9 months ago

danielmitterdorfer commented 9 months ago

Kibana version: 8.12.0

Elasticsearch version:8.12.0

Server OS version: Linux (ESS)

Browser version: N/A (issue is browser-independent)

Browser OS version: N/A (issue is browser-independent)

Original install method (e.g. download page, yum, from source, etc.): ESS

Describe the feature:

When we render a flamegraph in Universal Profiling, an Elasticsearch API sends the response already in a format that is suitable for rendering directly in the browser. Looking at APM we can see that we spend some time still in Kibana server when it basically "only" needs to pass the response from Elasticsearch to the browser as is. In the example APM trace below that time is around 2 seconds (the white gap after POST /_profiling/flamegraph ends):

apm-timeline-flamegraph

If we profile what Kibana server is doing, it seems that it is deserializing the response from Elasticsearch and then serializes it again. However, I believe we can eliminate that step because the response should already be in the exact format that the browser expects (if not I propose we adapt the ES API). Below is an annotated flamegraph that shows where Kibana server spends time serializing and deserializing the response from Elasticsearch:

flamegraph-kibana

elasticmachine commented 9 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

thomasneirynck commented 7 months ago

EDIT: for more up2date feedback, see https://github.com/elastic/kibana/issues/176208#issuecomment-2035689024


this is some initial feedback, purely based on a code audit and some offline discussion with @danielmitterdorfer


The flame-graph server side code in kibana uses the ES-client.

https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling_data_access/server/utils/create_profiling_es_client.ts#L126-L149

It does not specify the asStream option:

https://github.com/elastic/elasticsearch-js/blob/4aa00e03e1fac5385387dfa7cfd6e811d5307028/docs/examples/asStream.asciidoc

I am pretty sure, that when this is not set to true, the client will unpack the result.

When setting it to true, you can stream it back through the kibana server.

To do this, you must ensure the content-* headers match the output from Elasticsearch.


To see an example from the current Kibana code-base, please check tile-routes:

set asStream = true

https://github.com/elastic/kibana/blob/b22dd68d39a1770e5a1989709bd498c7ffb14f1c/x-pack/plugins/maps/server/mvt/mvt_routes.ts#L224

(optionally, you can gzip here)

pass on the raw response through the kibana-server response object. Ensure content-* headers match.

https://github.com/elastic/kibana/blob/b22dd68d39a1770e5a1989709bd498c7ffb14f1c/x-pack/plugins/maps/server/mvt/mvt_routes.ts#L258-L272

note: tileStream is the raw ES-response body

https://github.com/elastic/kibana/blob/b22dd68d39a1770e5a1989709bd498c7ffb14f1c/x-pack/plugins/maps/server/mvt/mvt_routes.ts#L230)


So, again, just from code-audit, I think we should make two changes:

use asStream in the ES-client here

https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling_data_access/server/utils/create_profiling_es_client.ts#L146-L147

pass the raw resp.body, but match the content-* headers

(this would be here (????))

https://github.com/elastic/kibana/blob/f67afe2866014cfbc21755134950d7abaa8cf8c4/packages/kbn-server-route-repository/src/register_routes.ts#L88

thomasneirynck commented 7 months ago

Following up on earlier comment.

Problem: the Elasticsearch client re-encodes the response.

The ES-client parses the response and re-encodes it. See here:

https://github.com/elastic/elastic-transport-js/blob/main/src/Transport.ts#L525 https://github.com/elastic/elastic-transport-js/blob/main/src/Serializer.ts#L62

Resolution: use asStream

The ideal solution would be to use the asStream option to stream the result back.

Add an asStream-option here: https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling_data_access/server/utils/create_profiling_es_client.ts#L146-L149

In order for this to work, the ES-response handling needs to change (see below)

is totalSeconds really needed?

https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling_data_access/server/services/fetch_flamechart/index.ts#L75

This adds a new property. totalSeconds, wrapping it in a new response object. This would prevent streaming back the result.

However, it is only used for tooltips, and it seems this value is optional (?)

https://github.com/elastic/kibana/blob/b3064ba67c25e02639a81ed8581c4fdda43e02e3/x-pack/plugins/observability_solution/profiling/public/components/flamegraph/flamegraph_tooltip.tsx#L85

Ensure response-object has correct content-headers

Ensure content-* encodings are transferred from the ES-response headers.

https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling/server/routes/flamechart.ts#L66

Adjust client in browser.

The client now expects a simple JSON.

https://github.com/elastic/kibana/blob/72a377d5b2927d75537838b7077e7bd1e340a20f/x-pack/plugins/observability_solution/profiling/public/services.ts#L114

This will need to be adjusted to handle a stream.


^ this is a broad investigation. I may have missed some intricacies, so it would be good to get some additional feedback from Obs-team on this as well.

rockdaboot commented 7 months ago

Thanks for the investigation @thomasneirynck and @danielmitterdorfer!

Ensure content-* encodings are transferred from the ES-response headers.

The ES response content is gzip compressed while the Kibana response is br (brotli) compressed. We did this to reduce the content size by ~50% and reducing the transfer time to the Kibana client side. See https://github.com/elastic/kibana/pull/142334

(From what I remember, in my hotel room with a slow network I had to wait ~50s for a gzip-compressed flamegraph and only ~25s when brotli-compressed. But fya, the compression rate also depends on the payload and the response content format has changed since then.)

Another caveat with simply transferring the headers is that you can not assume that the browser (or client) negotiate exactly the same content-* fields as Kibana server and ES negotiate. Sure enough, most browsers are lenient enough to accept a gzip content-encoding even when brotli has been negotiated. But "most" is not "all" - and we possibly don't won't to run into support issues here.

So I would vote for re-compression in case that content-encoding isn't the same on both sides. This is a simple decompression, compression without parsing and should be fast enough.

A possible future follow-up could be to allow brotli and/or zstd on the ES side (zstd has been included in ES recently, so that's just a matter of time/effort until we can use it).

rockdaboot commented 7 months ago

is totalSeconds really needed?

The totalSeconds value is derived from user settings (UI) only. It is timeTo-timeFrom, both request params. So this can be calculated on the client side as well (alternatively, it can be added to the flamegraph response if that is a better option).

roshan-elastic commented 3 months ago

Hi all,

Thanks for raising this. Is anyone able to summarise the kind of speeds we're seeing and the benefit we could gain?

Trying to get a sense of impact on customers

danielmitterdorfer commented 3 months ago

Thanks for raising this. Is anyone able to summarise the kind of speeds we're seeing and the benefit we could gain?

Hey @roshan-elastic! Sure, Kibana adds around 2 seconds of overhead. So in the scenario that I've tested we would reduce the response time from roughly 6 to roughly 4 seconds.

roshan-elastic commented 3 months ago

Thanks @danielmitterdorfer

roshan-elastic commented 3 months ago

FYI I've added this as a top-20 issue in the backlog