Enabling compression between Kibana <--> Elasticsearch servers

mshustov commented 2 years ago

Kibana relies on the elasticsearch-js client defaults with compression: false https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/basic-config.html

While enabling compression adds some runtime performance overhead, it might drastically reduce transmission time and increase network bandwidth. We should conduct load testing for Cloud and on-prem instances to decide whether Kibana will set compression: true by default or we expose it as elasticsearch.compression configuration setting.

A side note: it might be interesting to calculate how much it affects the Data transfer reductions initiative on Cloud.

Subtasks

[x] allow to configure compression for the ES client - https://github.com/elastic/kibana/pull/124009
[x] conduct load testing to understand the bandwidth and cpu usage consequences of enabling compression
[ ] decide if we should enable compression by default

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

delvedor commented 2 years ago

The v7 client has two compression options:

suggestCompression: if true, asks ES for compressed data via accept-encoding header.
compression: if true, sends compressed data to ES

The v8 client has a single compression option, which does both.

Furthermore, if you enable compression and are using maxResponseSize, remember to configure maxCompressedResponseSize as well.

The client defaults to false unless you are using Elastic Cloud (detected via the cloud id option), in such case, it will enable compression by default as it's recommended with Elastic Cloud.

mshustov commented 2 years ago

The client defaults to false unless you are using Elastic Cloud (detected via the cloud id option), in such case, it will enable compression by default as it's recommended with Elastic Cloud.

Ok, we have to set compression: true explicitly then. Since Kibana doesn't configure the cloud option.

pgayvallet commented 2 years ago

We should conduct load testing for Cloud and on-prem instances to decide whether Kibana will set compression: true by default or we expose it as elasticsearch.compression configuration setting.

We could start by exposing this new elasticsearch.compression config property while preserving the current default value of false, and then, depending on the results of the perf/load testing, decide if we want to switch the default to true later. It would allow customers really wanting/needing this feature to use it asap.

massimobrignoli commented 2 years ago

I agree with this approach. One of our customer with 1000 users in kibana, 3000 dashboards and 4000 queries/minute has tested it and has seen the network traffic dropping by 90% going from 500GB/hour to 50GB/hour.

Of course not all the query will benefit (small queries with a very small result set can be slower) and the CPU usage will increase a bit. But let the users decide.

pgayvallet commented 2 years ago

I opened https://github.com/elastic/kibana/pull/124009 to introduce a new elasticsearch.compression configuration property that will default to false (which is the effective current value) for now.

pgayvallet commented 2 years ago

https://github.com/elastic/kibana/pull/124009 was merged, I updated the issue accordingly and added a subtasks list.

pgayvallet commented 2 years ago

I performed some testing around the performance and bandwidth impact of enabling compression for the Kibana<->ES communications.

Bandwidth

I couldn't find any proper way to monitor the bandwidth usage when running the load tests, so I felt back to using a homebrew proxy between Kibana and ES to monitor the request and response size of some queries while manually navigating within Kibana.

No big surprise here, the gain is what you would expect from enabling compression on any HTTP-based communication. Depending on the size of the request and response, there is a 20% to 90% compression.

I won't list everything here, but for example the /_search response associated with loading the sample data's dashboard from the dashboard listing page get compressed around 85%, and that's only returning 3 dashboard documents.

// without compression
  request length: 263
  response length: 66893

// with compression
  request length: 181
  response length: 10884

Overall, we should be expecting an overall reduction in bandwidth usage when enabling compression of at least 60%, and probably more like 80%, depending on the specific usages of the instance/customer, which is significant.

Performance impact.

I couldn't perform load testing against the whole stack on Cloud, given the elasticsearch.compression is not allow-listed, so I performed two suites of tests:

Local Kib + Local ES
Local Kib + Remote ES (in Cloud)

(all suites were ran 3 times, with very similar results so I'm only showing one of the 3 results for each)

Local Kibana on a local ES - 100 users

Without compression

With compression

Analysis

This was surprising. Given that compression/decompression is done outside of the main event loop in node, I was expecting a negligible impact on performance. Tests show that this is far from negligible when Kibana/ES are under heavy load in that scenario.

Overall mean is +50% when compression is enabled (900 vs 600). When drilling down, it vary between +15% and +100% depending on the requests. Supposedly, the biggest the response, the highest the difference in mean value.
Min value is not affected much for any endpoint
Overall 50th pct and 75th pct are doubled (!) . When drilling down, the results vary a lot, from no difference to twice the response time.
95th and 99th pcts are less impacted than the previous ones, with approx 25% difference between compression and no compression.

Local Kibana on a Cloud ES - 200 users

Without compression

With compression

Analysis

The results are way more acceptable than during local testing, which tend to indicate that either the bottleneck during local testing was more on the ES side than on the Kibana side, or that adding real latency for the communications (that can't be reproduced when hitting the local loopback in a local-to-local scenario) reduces the compression/decompression overhead significantly.

Min value is not affected much for any endpoint
The 50th pct is 25% higher with compression enabled, and mostly consistent for each endpoint
75th, 95th and 99th are almost the same with or without compression

Conclusion

Should we set elasticsearch.compression to true for all Cloud instances?

Good question, and not sure who should be answering it. We should probably reach out to Cloud to decide whether the bandwidth reduction is worth the performance impact?

@stacey-gammon @lukeelmers wdyt?

Should we allow customers to configure elasticsearch.compression on Cloud

If the answer to the previous question is 'no', then ihmo yes. We confirmed that it works correctly and that the performance impact is not significant enough to restrict the usage of this option.

mshustov commented 2 years ago

This was surprising. Given that compression/decompression is done outside of the main event loop in node, I was expecting a negligible impact on performance. Tests show that this is far from negligible when Kibana/ES are under heavy load in that scenario.

Maybe you faced this problem with many parallel requests initiated by kibana-load-testing? https://nodejs.org/docs/latest-v16.x/api/zlib.html#threadpool-usage-and-performance-considerations Have you tried to reduce the number of parallel connections?

Should we set elasticsearch.compression to true for all Cloud instances?

Maybe we can spend some additional time testing different compression settings? I'm wondering if changing setting compression level to zlib.constants.Z_BEST_SPEED can help us a lot. from the nodejs docs https://nodejs.org/docs/latest-v16.x/api/zlib.html#compressor-options

The speed of zlib compression is affected most dramatically by the level setting. A higher level will result in better compression, but will take longer to complete. A lower level will result in less compression, but will be much faster.

@delvedor Does the ES client always uses compression if specified? Does it make sense to add a threshold? Overhead on compression of small objects can be higher than the benefits from sending a small chunk.

pgayvallet commented 2 years ago

Maybe you faced this problem with many parallel requests initiated by kibana-load-testing

Fairly possible. I did reduce the number of concurrent users to 100 on local testing though (as 200 users was just too much for my machine apparently, either with or without compression I was getting lot of errors once the ramp up was over)

Maybe we can spend some additional time testing different compression settings? I'm wondering if changing setting compression level to zlib.constants.Z_BEST_SPEED can help us a lot

Worth a try, but as you already mentioned, I don't think we have control over the compression configuration the elasticsearch client is using (at least atm)? @delvedor could you confirm that?

pgayvallet commented 2 years ago

Maybe we can spend some additional time testing different compression settings?

Also after thinking a bit more about it, this should only impact the compression performance of the requests toward ES, not the decompression of the responses (as we don't have control over the compression settings of the ES server), and given that responses are significantly higher in length than requests, I'm not sure this will impact the benchmark that much.

mshustov commented 2 years ago

Also after thinking a bit more about it, this should only impact the compression performance of the requests toward ES, not the decompression of the responses (as we don't have control over the compression settings of the ES server), and given that responses are significantly higher in length than requests, I'm not sure this will impact the benchmark that much.

I didn't benchmark this, but it seems that decompression is way faster than compression. From http://facebook.github.io/zstd/ 2022-02-15_11-59-50

stacey-gammon commented 2 years ago

I think we should start with allowing customers to turn this setting on, so we can test it first on internal clusters. Then we should have a way to compare the performance and the data transfer rates to give us confidence that turning it on for all clusters won't cause issues.

pgayvallet commented 2 years ago

Config option was added in https://github.com/elastic/kibana/pull/124009 for v8.1.0+

@stacey-gammon should I open a PR to add the config option to cloud's allowlist?

stacey-gammon commented 2 years ago

yes, that'd be great!

pgayvallet commented 2 months ago

Unassigning as no longer actively working on this

elastic / kibana