apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

Use Theta sketch compression #15731

Closed AlexanderSaydakov closed 4 days ago

AlexanderSaydakov commented 10 months ago

Theta sketch compression is available for quite some time in the Apache DataSketches library. I would suggest enabling it in Druid. The simplest way would be to start serializing Theta sketches in compressed format. Deserialization automatically detects and supports that format starting from datasketches-java-4.0.0 and datasketches-cpp-4.1.0 (May 2023). There is some overhead in converting sketches to bytes, but in an I/O bound system usually this is a reasonable CPU vs I/O tradeoff. In other words, compression reduces I/O (and storage cost) by spending more CPU, which is likely to yield overall benefit.

Theta sketch compressed size
AlexanderSaydakov commented 10 months ago

Alternative to compression by default would be making it configurable by the user per column or per table or per installation or some other way. I am not sure this extra complexity is needed.

abhishekagarwal87 commented 10 months ago

How much of a CPU overhead does the compression come with?

AlexanderSaydakov commented 10 months ago

Sorry I could not find measurements in Java. I will run them again, but that takes quite a while. Here are measurements in C++ just to have some idea.

Theta sketch compression time C++
AlexanderSaydakov commented 10 months ago

This time is just to convert sketches to bytes.

github-actions[bot] commented 1 month ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 4 days ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.