[Research] Data format improvements for charting (arrow) #175695

Open thomasneirynck opened 6 months ago

thomasneirynck commented 6 months ago

Currently, the majority of all charts use the default Json output from Elasticsearch. These responses by default have a row-like (in the case of es|ql or doc-search) or nested (in the case of aggs) layout.

Internally, Kibana will reformat these to something more usable. e.g. a format understood by elastic/charts, nested-array tables for easier ergonomics, etc...

These client-side reformattings introduce an overhead.

Is it possible to have a more efficient pipeline (?), either by reducing network traffic, reducing reconversions (or both).


Investigate impact of data format on kibana data visualization (specifically, Lens & Dashboard).

Consider both the context of:

Consider alternatives:

elasticmachine commented 6 months ago

stratoula commented 5 months ago

@nik9000 was working on their ON week on exposing the ESQL results in arrow format. I think it is awesome to continue investigations in this front. Can make our visualizations much more performant and dense!

drewdaemon commented 5 months ago

Agree, both in terms of performance and complexity.

markov00 commented 3 months ago

ppisljar commented 2 months ago

adding (basic) arrow support to expressions: https://github.com/elastic/kibana/pull/183909

This showcases that it is not very hard to convert from arrow to datatable and vice versa, which would allow us to gradually migrate our code to the new format.

thomasneirynck commented 1 month ago

Thanks @ppisljar for https://github.com/elastic/kibana/issues/175695#issuecomment-2123932078. This was super useful.

Below a follow from an offline convo with @markov00 and @ppisljar. Apart fromthis initial look into arrow, there are a few more open questions. I think it might also be useful to recap some of the underlying reasons for this research for wider visibility.

1. We should build up our knowledge arrow because of its strategic value in contemporary tech stacks

Arrow has strategic value because it the main data-interchange formats for interprocess data analytics (e.g. ML with pandas in Python), GPU-based charting (e.g. dense scatterplots), or in a web context to do client-side analytics (e.g. duckdb-wasm https://duckdb.org/docs/api/wasm/overview.html)

For that reason alone, it is important to gain a better understanding of this format.

From @ppisljar initial investigation (https://github.com/elastic/kibana/pull/183909), the short term take-away seems to be that a "backend swap" of JSON vs Arrow may not be hard technically, but it would not be the right choice in the short term. (a) poor client support in the browser (e.g. having to use unsafe-eval) (b) existing data-pipeline in Lens (ie. "expressions") - which needs to marshall the data into a new table and which does some intermediate data-enrichment - requires a full read of the arrow-table, remarshaling everything to JSON anyway. This conversion is slow.

Kibana has a very low investment in GPU-technology today (except flamegraph and maps), and introducing a new model of client-side analytics (e.g. one which runs in WASM with duckdb) is not directly on the horizon either. imho it is OK with postponing further investigation in these long term topics. We can always pick up those aspects up once it becomes more tactically relevant (e.g. when scatterplots are prioritized)

What is not answered though is whether:

2. JSON vs Binary. Is there any low-hanging fruit for saving space in size of data transmitted over the wire?

arrow is just one example of a binary format. Other examples could be cbor or smile, which are supported by Elasticsearch.

Any gains we can make in transfer format can be meaningful, especially since it would get our stack to closer deliver data in a streaming-fashion: ie. an elasticsearch response stream should just be streamed back as-is to the browser, without further modification, especially if that modification is redundant.

It seems there is some additional processing in Kibana server (specifically for async searches (?)), which would prevent us from doing this.

Whether the Elasticsearch-js client supports formats other than JSON is imo less relevant. Users can always unpack the data manually by using the as-stream option (https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/as_stream_examples.html). If we can demonstrate value, we can always push this support down in the ES-client as well.

So for this, I think we are missing answers to:

3. column vs row layout ("visualization friendly" format)

This is more of an orthogonal issue to binary/JSON. This is about data layout.

This would need to be investigated in the context of expressions and elastic/charts.

I believe this is already the default for ES|QL (?).

4. ES|QL versus DSL

(1) and (3) the questions above only have relevance for ES|QL.

(2) applies to both, and imo is therefore important.

5. The "big picture" - thinning kibana server

The big picture for 1, 2, 3, and 4 is that we should aim to remove as much intermediate, redundant processing on data, especially on Kibana-server. Processing-in-browser is only felt by one particular user, while load on kibana server affects all userss. Does Lens really need enrichment of data in kibana-server?

e.g.: a performant data-viz architecture image

6. What team does this belong?

@markov00 raised whether the @elastic/kibana-visualizations should own this research. imho, yes, but with an asterix.

Yes, because visualizations are the main consumer of Elasticsearch-agg responses, and we would expect changes to be motivated by reducing the time it takes to render data on screen in a chart.

The asterix is that if there any resource constraints we can always see if we can distribute these investigations more broadly (e.g. @elastic/kibana-data-discovery, @davismcphee @kertal @lukasolson)

So to recap; I see following open questions:

1) Size comparison of arrow versus current ES|QL (Using https://github.com/elastic/elasticsearch/pull/104877 may be helpful here) 2) What are the size/performance advantages of cbor/smile? Are they supported by ES|QL? 3) What are the blockers to adopt cbor/smile? Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary? 4) Is there anything more that needs be done wrt column-based layouts?

ppisljar commented 1 month ago
  1. Arrow is a binary format, so it will generally be more efficient from size perspective than json. In some tests i did i saw around 30% reduction of size. However important note here is that we are using gzip compression, and after compression the filesizes are mostly the same, or arrow format actually becomes bigger.

  2. Havent tested this yet, but from resources on the internet it looks its similar to arrow, there is a significant reduction if you dont gzip, but after gziping reduction is less noticable.

http://zderadicka.eu/comparison-of-json-like-serializations-json-vs-ubjson-vs-messagepack-vs-cbor/ https://gist.github.com/kajuberdut/0191ec20f14253094792cd3c00f06257 https://medium.com/@ayushguptadtu/gzip-smile-json-gives-a-better-size-reduction-over-smile-uncompressed-for-sure-6c5060a670a5

vadimkibana commented 1 month ago

The most performant way would be to request data from ES in CBOR and pass it through the Kibana server without any parsing (or minimal parsing) straight to the client. So this is the key question:

Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary?

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

lukasolson commented 1 month ago

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

Related: https://github.com/elastic/kibana/issues/170062

thomasneirynck commented 1 month ago

thx @ppisljar - if arrow is larger gzipped, I think it's another argument against arrow being a pathway for a tactical improvement.

@vadimkibana agreed. The key part of these investigations is whether we can slim down the data pipeline from Elasticsearch all the way to the browser. Reduction in size of the data format (faster delivery, cheaper too), wasted cycles of encoding/decoding (faster), and removing redundant enrichment (wasted processing) are all pathways to get there. Any footprint on kibana-server is particularly bad because it is felt by all users, and any impact from processing doesn't scale favorably due to single threaded execution (e.g. by delaying other requests, and this compounds)