This is mostly a wrapper on top of the frequent items sketch, but have made the two following modifications to make it somewhat more user friendly:
The sketch requires a k which is a power of two. Instead of making this a user requirement, we will take the next closest power of two and truncate down to k in the results.
The underlying map used by the sketch has a load factor of 0.75, which causes the approximation to kick in before there are k elements. We changed this to keep an exact histogram/map and then switch over to the sketch once the entries exceed k.
Why / Goal
The current histogram operation uses unbounded memory and isn't stable for production use cases.
Test Plan
We've end to end tested everything through backfills/CLI on the batch side on our end.
Summary
Adds an APPROX_HISTOGRAM_K operation based on the FrequentItems Sketch: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
This is mostly a wrapper on top of the frequent items sketch, but have made the two following modifications to make it somewhat more user friendly:
Why / Goal
The current histogram operation uses unbounded memory and isn't stable for production use cases.
Test Plan
We've end to end tested everything through backfills/CLI on the batch side on our end.
Checklist
Reviewers