Closed AlexanderSaydakov closed 4 days ago
Alternative to compression by default would be making it configurable by the user per column or per table or per installation or some other way. I am not sure this extra complexity is needed.
How much of a CPU overhead does the compression come with?
Sorry I could not find measurements in Java. I will run them again, but that takes quite a while. Here are measurements in C++ just to have some idea.
This time is just to convert sketches to bytes.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Theta sketch compression is available for quite some time in the Apache DataSketches library. I would suggest enabling it in Druid. The simplest way would be to start serializing Theta sketches in compressed format. Deserialization automatically detects and supports that format starting from datasketches-java-4.0.0 and datasketches-cpp-4.1.0 (May 2023). There is some overhead in converting sketches to bytes, but in an I/O bound system usually this is a reasonable CPU vs I/O tradeoff. In other words, compression reduces I/O (and storage cost) by spending more CPU, which is likely to yield overall benefit.