apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
223 stars 71 forks source link

Implement t-Digest #409

Closed AlexanderSaydakov closed 6 months ago

AlexanderSaydakov commented 9 months ago

We would like to implement t-Digest based on the following paper: Ted Dunning, Otmar Ertl. Extremely Accurate Quantiles Using t-Digests and the following implementation in Java: https://github.com/tdunning/t-digest

AlexanderSaydakov commented 9 months ago

There is an obstacle. The cdf(x) method in the reference Java implementation throws AssertionError in some cases. An example to reproduce the error was submitted as an issue: https://github.com/tdunning/t-digest/issues/215

AlexanderSaydakov commented 9 months ago

Also unclear why stable sort is needed in merge operation

tdunning commented 9 months ago

There are two questions here. One is about the case of updates with weight >1 and the other is about stable sorting.

The first issue is inherent in adding samples with greater than unit weight. Unless the digest can split that sample into multiple updates, the invariant can easily be violated. In the test in question, a single sample at 1 is given with weight two. Inserting this as is results in a single centroid with weight 2 which violates the invariant which says that all samples must have proper scaled weight or have weight 1. If we were allowed to split this into two samples at the same point, the invariant can be preserved. The problem with that is that we don't really know if this is two samples at the same value or two samples with that mean value. That might be resolved by documenting that it should not be used to enter mean values.

Regarding the stable sort, this is also required to avoid violating the invariant. Take the case where you have a digest generated from a bunch of samples at the same point. These samples will be gathered into centroids as the invariant allows, but all centroids created that way will have the same mean value. If you add a bunch more samples and (stably) sort the new samples into the old centroids things will be fine if the order of the old samples is preserved since you can add unit weight centroids anywhere you like without changing the invariant. On the other hand, if you use an unstable sort you can move the big centroids away from the center of the distribution and thus violate the invariant.

These sorts of things are the biggest reason for having so many tests that focus on repeated values.

tdunning commented 9 months ago

I should add that since you appear to be using cpp for this new implementation, you can sort entire structures. That can radically simplify your work since you can use the standard sorting utilities instead of sorting indexes.

AlexanderSaydakov commented 9 months ago

As I understand, we should not even provide a public add() method with weight. Of course I am going to call std::sort or std::stable_sort. I am afraid I do not quite get the importance of stable sort and what exactly the invariant is.

tdunning commented 9 months ago

The need for stability comes from the need to preserve the t-digest invariant which is defined by a scale function.

The scale function is a function that maps q (the quantile) to k (a scaled version. Scale functions look like kind of like this:

[image: image.png] The t-digest invariant says that the k-size of any centroid must be less than one except if the weight of the centroid is 1.

The k=size of a centroid is k(q_right) - k(q_left) where q_left is the value of q on the left side of the centroid and q_right is the value on the right. Because the scale function is steep near q=0 and q=1, the k-size of a cluster with constant weight is bigger if it is near the ends rather than near the middle. Defining the bound in terms of the scale function is what constrains the centroid sizes so that interpolation near these extremes will give good results. The invariant also bounds the size of the t-digest for suitable scale functions (see https://arxiv.org/pdf/1903.09921.pdf for details).

This is all very simple when all of the samples that we use to create the t-digest are distinct because that means all of the centroids will have different means and sorting will work predictably.

But what if we have 1000 samples all at the same point? The sizes of the centroids in the resulting t-digest near the center will be bigger than the ones at the edge. That's good. But if we sort those centroids purely by their means (which are all the same) with an unstable sort, they could get all jumbled. That could put the biggest centroids near the edge where their k-size would be much larger than it was near the center.

Does this make sense?

On Tue, Dec 12, 2023 at 10:42 AM Alexander Saydakov < @.***> wrote:

As I understand, we should not even provide a public add() method with weight. Of course I am going to call std::sort or std::stable_sort. I am afraid I do not quite get the importance of stable sort and what exactly the invariant is.

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1852608805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6RBVBVGDRBCSZ6WVFDYJCQR7AVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSGYYDQOBQGU . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 9 months ago

Yes, thank you very much.

AlexanderSaydakov commented 9 months ago
TDigest 100 rank error uniform random input
AlexanderSaydakov commented 9 months ago
TDigest 100 update time random input
tdunning commented 9 months ago

Those look pretty sweet!

Latency shows a nice hard upper bound and accuracy looks identical to the java reference implementation.

I should get you to finish my Julia implementation!

AlexanderSaydakov commented 9 months ago

I checked in the code as a branch here: https://github.com/apache/datasketches-cpp/tree/tdigest Not done yet: get_quantile(rank) method, merge(other) method This is equivalent to MergingDigest with the default K_2 scaling function. I am not sure whether we want other flavors and scaling functions. Also I don't quite understand a couple of expressions in get_rank() (cdf() in Java) marked with "// ?"

AlexanderSaydakov commented 9 months ago

Also I am not sure that the buffer needs to have weights since the buffered input values have weight of 1. Perhaps it is a tradeoff to avoid allocations during compressions (and, perhaps, merging t-digests, which I did not implement yet)

tdunning commented 9 months ago

The buffer has weights to increase the symmetry in the code (everything has weights). It makes every kind of sorting and merging look the same.

On Tue, Dec 19, 2023 at 12:57 PM Alexander Saydakov < @.***> wrote:

Also I am not sure that the buffer needs to have weights since the buffered input values have weight of 1. Perhaps it is a tradeoff to avoid allocations during compressions (and, perhaps, merging t-digests, which I did not implement yet)

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1863461948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6XGPMGOGG4O4AR4HG3YKH5UXAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTGQ3DCOJUHA . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 9 months ago

characterization code: https://github.com/apache/datasketches-characterization/pull/60

AlexanderSaydakov commented 9 months ago

I am looking at the code to merge t-digests. I don't like that a modifying operation compress() is called on the input t-digests. I wonder whether this is avoidable. I would think this is done for simplicity to avoid dealing with unmerged buffer in the incoming t-digests, however the compress() method forces compression even if the buffer is empty.

There is a practice in C++ to pass const reference to the input object. Having a requirement of a mutable input is quite ugly. And forcing compression unnecessarily is even worse. I am not sure how to deal with this.

tdunning commented 9 months ago

The compression on the inputs is not necessary. It may have a positive effect on accuracy in a few cases, but I strongly doubt the effect will be important. In practice, most of the digests being merged will previously have been persisted and thus will have already been compressed.

Just allocate a buffer big enough for all the live centroids in all of the merging digests, copy in the centroids and sort/compress. A final copy to a trimmed buffer is a nice touch, but only necessary if the intermediate buffer is massive.

An alternative would be to simply pour the centroids in, one digest at a time. Putting in part of a digest at a time will lead to situations where the invariant may be violated transiently until you get the rest of the data in. I haven't analyzed that scenario in detail so I can't say that it will be completely safe. The case where you add an entire t-digest at a time has been analyzed, however, and is safe. This extends to adding multiple complete digests.

I like and sympathize with keeping arguments const and might well have preferred that approach were I to do it over again.

On Fri, Dec 22, 2023 at 2:23 PM Alexander Saydakov @.***> wrote:

I am looking at the code to merge t-digests. I don't like that a modifying operation compress() is called on the input t-digests. I wonder whether this is avoidable. I would think this is done for simplicity to avoid dealing with unmerged buffer in the incoming t-digests, however the compress() method forces compression even if the buffer is empty.

There is a practice in C++ to pass const reference to the input object. Having a requirement of a mutable input is quite ugly. And forcing compression unnecessarily is even worse. I am not sure how to deal with this.

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1868105273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6RT7HJBQ7GLML3F6P3YKYB5BAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGEYDKMRXGM . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 8 months ago

I implemented merge without calling compress() on the input and one tdigest at a time. Here is what happens with rank error after merge. The plot is too crowded. I can try plotting some things separately later. The rank error seems to be a bit worse compared to single tdigest, but does not seem to get worse with increased number of tdigests (even improves a little). I tried 32, 64 and 128 (the number does not have to be a power of 2, just a habit).

TDigest 100 rank error after merge uniform random input
AlexanderSaydakov commented 8 months ago

And this is around rank 0.95

TDigest 100 rank error after merge 0 95 uniform random input
AlexanderSaydakov commented 8 months ago

Now 32-way merge only

TDigest 100 rank error after 32-way merge 0 99 uniform random input png
tdunning commented 8 months ago

I would recommend a higher default compression factor than 100.

But otherwise, your results look good. The reason for the increased accuracy in some cases is that by merging you are effectively keeping a lot of extra centroids in all of the t-digests that you are merging together. This was the motivation behind the way that the merging digest doesn't fully compact the centroids in the buffer until persisting the digest. This has a cost in terms of buffer space (and thus requires a few percent more merges) but wins in accuracy.

On Wed, Jan 10, 2024 at 10:09 PM Alexander Saydakov < @.***> wrote:

Now 32-way merge only TDigest.100.rank.error.after.32-way.merge.0.99.uniform.random.input.png.png (view on web) https://github.com/apache/datasketches-cpp/assets/13126686/4d6489b3-119e-46fe-a529-da2f3ab1aef5

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1886353997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6XIHDSMQSR5ZPQC2TTYN5625AVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGM2TGOJZG4 . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 8 months ago

Well, I am not sure I understand this argument about increased accuracy due to keeping more centroids. If so, why single digest does better? Perhaps, I should measure with just a few tdigests.

tdunning commented 8 months ago

Hmm... I misunderstood your scenario.

When you merge digests do you merge all of their centroids at once into a single buffer?

On Wed, Jan 10, 2024 at 11:19 PM Alexander Saydakov < @.***> wrote:

Well, I am not sure I understand this argument about increased accuracy due to keeping more centroids. If so, why single digest does better? Perhaps, I should measure with just a few tdigests.

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1886436262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6RJXGHKAFEYCMS6T4LYN6G7BAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGQZTMMRWGI . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 8 months ago

I measure like this: for each stream length (16 points per octave) spray the input in round-robin fashion across N sketches, then create a fresh sketch and merge all N sketches into it, measure rank error. Here is the result for just 2 sketches

TDigest 100 rank error after 2-way merge 0 99 uniform random input
AlexanderSaydakov commented 8 months ago

Merging 2 digests in Java has this unexpected result (I was even afraid that I misconfigured something, but I believe I double-checked)

Screenshot 2024-01-11 at 10 45 28 PM
AlexanderSaydakov commented 8 months ago

added 4-way merge

Screenshot 2024-01-12 at 5 32 12 PM
AlexanderSaydakov commented 8 months ago
Screenshot 2024-01-18 at 10 02 04 PM
AlexanderSaydakov commented 8 months ago
Screenshot 2024-01-18 at 10 07 07 PM
AlexanderSaydakov commented 8 months ago
Screenshot 2024-01-18 at 10 08 48 PM
tdunning commented 8 months ago

these are fascinating (and depressing).

I would strongly recommend you shift to using δ=200 instead of 100 for practical use.

But otherwise, you are digging up some interesting and surprising phenomena.

On Thu, Jan 18, 2024 at 10:09 PM Alexander Saydakov < @.***> wrote:

Screenshot.2024-01-18.at.10.08.48.PM.png (view on web) https://github.com/apache/datasketches-cpp/assets/13126686/8bae5147-f177-42b7-8e0a-5b32a3dffb2a

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1899823416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6TK7UG6GPXRNLG5TKDYPIEYTAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHAZDGNBRGY . You are receiving this because you commented.Message ID: @.***>

AlexanderSaydakov commented 8 months ago
Screenshot 2024-01-19 at 1 22 47 PM
AlexanderSaydakov commented 8 months ago

@tdunning could you explain why forced compression is needed (if buffer is empty)?

jmalkin commented 7 months ago

I think we can close this issue now that it's merged?

AlexanderSaydakov commented 7 months ago

Perhaps. There might still be a few things to decide. For instance, I did not change the default from 100 to 200 as @tdunning suggested. There was no explanation why. Maybe we should not have any default? tdigest<float>(100) is about 10KB in memory. tdigest<double>(100) is about 20KB.

tdunning commented 7 months ago

There is a typo in your last comment.

But, yes, the footprint with δ = 200 is twice the footprint for smaller values. The issue is that I changed the definition slightly a while ago and a default of 200 matches peoples accuracy expectations better than δ = 100 does. I prefer it if people who use the default aren't surprised negatively by accuracy so that's the default I recommend.

Note that the persisted size will be substantially smaller since persistence trims away the incoming buffer and does a final compression.

AlexanderSaydakov commented 7 months ago

That was not a typo, but invisible <float> and <double> because of angle brackets. I have just fixed that.

tdunning commented 7 months ago

Ahhhh!

Makes MUCH more sense. In any case changing 100 to 200 will roughly double the memory footprint. And the disk footprint (which will be much smaller).

AlexanderSaydakov commented 7 months ago

I wonder whether we should have a default at all. What if we just let the user decide?

tdunning commented 7 months ago

It's a reasonable approach. I wanted users to have a reasonable starting point if they didn't want to learn about the knobs, thus I gave a default.

YMMV in your own package. I think you will need to at least provide a recommendation.

AlexanderSaydakov commented 7 months ago

OK, let's keep the default. I changed it to 200, and therefore the internal compression is 400. Is this what you had in mind?

tdunning commented 6 months ago

Yes.

Exactly.

On Mon, Mar 4, 2024 at 2:38 PM Alexander Saydakov @.***> wrote:

OK, let's keep the default. I changed it to 200, and therefore the internal compression is 400. Is this what you had in mind?

— Reply to this email directly, view it on GitHub https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1977590696, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6SKV44QKFVIFZU7JMTYWTZVXAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZXGU4TANRZGY . You are receiving this because you were mentioned.Message ID: @.***>