Open shubhanshu02 opened 2 weeks ago
Yeah that is the correct way to use it when streaming data and this what I currently do.
We can add an update
method to wrap this behaviour if that's useful or if you would like to contribute then I'm happy to review a PR.
Great, thanks for confirming. Sure, I can make a pull request.
There is one another form of this algorithm mentioned in the t-digest paper where we first buffer the data points for some time and then merge the two t-digests. (Page 4, first paragraph)
One version keeps a buffer of incoming samples. When the buffer fills, the contents are sorted and merged with the centroids computed from previous samples. This merging form of the t-digest algorithm has the virtue of allowing all memory structures to be allocated statically. On an amortized basis, this buffer-and-merge algorithm can be very fast especially if the input buffer is large.
This results in better control over accuracy, speed and memory if we buffer-and-merge with different compression factors (stratified merge on Page 10).
Here, we can have the caller specify these values while keeping some default values:
max buffer size
: Maximum data points to collect in buffer before merging.delta
: compression factor to use for creating t-digest objectsmerge delta
: compression factor to merge t-digest objects1 and 3 will only be required when the caller uses the update
method. Otherwise, they will not be required. What do you think about this?
I think it would be best to keep the update
method simple and leave it to the user to buffer the incoming data prior to calling update
. Having the distinction between the delta
and merge_delta
makes sense here though I think.
Generally we're keen to keep this library small and focused but if there are other useful building blocks needed for this alternative approach we can consider adding them.
So, if I understood properly, then this is API we are trying to provide here:
function update(arr: numpy array):
// takes the buffer of numbers and merges them with the current t-digest
...
Does this look good to you?
Yeah I think that's all we need. Then just expose the delta
and merge_delta
parameters as you suggested.
def update(self, array: np.ndarray, delta: float, merge_delta: float)
Got it. Thanks.
I am working on a streaming system where I need to calculate the statistics of a metric like percentiles and medians. While the data I am getting is a stream, I want to query the percentiles at certain intervals.
Similar to the Python t-digest libray (https://github.com/CamDavidsonPilon/tdigest) which provides an option to update the t-digest with
digest.update(value)
function, does this library expose any function to add the data as it gets available?Below is one way I can think of for archiving this. Is there any better way of doing this?