airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
717 stars 44 forks source link

Add APPROX_HISTOGRAM_K Operation #735

Closed jbrooks-stripe closed 3 months ago

jbrooks-stripe commented 6 months ago

Summary

Adds an APPROX_HISTOGRAM_K operation based on the FrequentItems Sketch: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

This is mostly a wrapper on top of the frequent items sketch, but have made the two following modifications to make it somewhat more user friendly:

Why / Goal

The current histogram operation uses unbounded memory and isn't stable for production use cases.

Test Plan

We've end to end tested everything through backfills/CLI on the batch side on our end.

Checklist

Reviewers

jbrooks-stripe commented 6 months ago

Let's also add a test to spark/test/FetcherTest.scala

Looks good otherwise. Mostly non-blocking comments.

Appreciate you taking a look. Will address your comments and some from our internal repo shortly.