apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
893 stars 209 forks source link

Testing equality of HllSketch #157

Closed hpx7 closed 7 years ago

hpx7 commented 7 years ago

The following code snippet fails:

        Union hll1 = new Union(12);
        Stream.of("1", "2", "3", "4", "5", "6", "7", "8", "9").forEach(hll1::update);

        Union hll2 = Union.heapify(Memory.wrap(hll1.toCompactByteArray()));

        if (!Arrays.equals(hll1.toCompactByteArray(), hll2.toCompactByteArray())) {
            throw new AssertionError("Hlls not equal");
        }

(using com.yahoo.sketches.hll.Union)

Is this a bug in hll serde?

leerho commented 7 years ago

@hpx7

No. This is not a bug.

Rule: The user should never make any assumptions about the internal structure of the binary images of two different sketches. Period. And this rule applies to all the sketches in the DataSketches library. And "equals" makes an assumption about the internal byte ordering.

There are good reasons why we do not provide a .equals(sketch) or .hashCode() for sketches in the library.

The best policy is to not second guess what is actually going on inside the sketch and follow the rule above.


In your code snippet, if you had compared hll1.getEstimate() to hll2.getEstimate() you would have found that the sketches produce exactly the same estimate. But I don't recommend performing exact comparisons either as small probabilistic errors can make the estimates slightly different.

In your code you fed a lgK=12 sketch with only 9 items, and internally the sketches were in a "warm-up" coupon hash-SET mode and not even in HLL mode yet. When deserialized, the hash-SET of the second sketch had the same exact coupons, but in a different order. You can see this from the sketch.toString(true, true, true) output as follows:

### HLL SKETCH SUMMARY: 
  Log Config K   : 12
  Hll Target     : HLL_8
  Current Mode   : SET
  LB             : 9.0
  Estimate       : 9.000000178813938
  UB             : 9.000449542078428
  OutOfOrder Flag: true
### HLL SKETCH HLL DETAIL: 
         0  39157696     1
         9   7621065     3
        10  34580138     1
        12  62715826     5
        14  53139118     3
        15  47787151     3
        18  66406738     1
        19  26341843     1
        24  33360856     2
### HLL SKETCH AUX DETAIL: 

### HLL SKETCH SUMMARY: 
  Log Config K   : 12
  Hll Target     : HLL_8
  Current Mode   : SET
  LB             : 9.0
  Estimate       : 9.000000178813938
  UB             : 9.000449542078428
  OutOfOrder Flag: true
### HLL SKETCH HLL DETAIL: 
         0  39157696     1
         9   7621065     3
        10  34580138     1
        14  53139118     3
        15  47787151     3
        18  62715826     5
        19  26341843     1
        24  33360856     2
        29  66406738     1
### HLL SKETCH AUX DETAIL: 
hpx7 commented 7 years ago

Thanks for the explanation @leerho.

Spark requires us to serialize/deserialize all our objects to/from byte arrays. I'd like to verify the "correctness" of serde for the objects that we use. How would you suggest I do that in this case? Perhaps I can compare the unordered set of coupons?

IMO serde should preserve the complete state of an object, don't think nondeterminism plays into account here. For example, if any object was using a rng, you would want to serialize/deserialize the seed.

AlexanderSaydakov commented 7 years ago

I would suggest testing your code with some sort of dummy (mock) sketches if possible.

Regarding what serialization should or should not preserve. It is a design choice. Consider a hash map. It can be serialized exactly with all the empty slots, so that deserialized map is identical. However, it is wasteful. You can save a lot of space and time if you are willing to drop the requirement of bit-to-bit equality, and only preserve the essential parts. That means the reconstructed map can have entries in different slots compared to the original, but it shouldn't matter for all practical purposes.

hpx7 commented 7 years ago

Yes, but there should be a way to verify that the original hashmap and the deserialized one are equal (e.g. by comparing the actual set of values).

How do you guys test serde correctness for the sketches? What if there was a subtle bug in the HLL serde code that caused the estimate to decrease by 0.1 after each subsequent serialize/deserialize? Certainly there must be a way to verify that serde preserves "equality".

leerho commented 7 years ago

@hpx7

I'd like to verify the "correctness" of serde for the objects that we use. How would you suggest I do that in this case?

I suggest you separate the testing of your framework or system from the validation of the sketches themselves, i.e., using mocks. If your system can transport byte arrays without corrupting them in any way you will be fine. (This, it turns out, is non trivial and corruption does, and has happened in various system environments.)

Perhaps I can compare the unordered set of coupons?

Do not go there. Do not make any assumptions about the internals of the byte array data structures. The definition of these structures are package private and we reserve the right to change them at any time. And they do, from time to time, change, and then we go to great lengths to maintain backward compatibility (via conversions if necessary) with earlier versions of these data structures. If you had made assumptions about these structures, your code would break.

IMO serde should preserve the complete state of an object, don't think non-determinism plays into account here. For example, if any object was using a rng, you would want to serialize/deserialize the seed.

I strongly disagree. Forcing these probabilistic structures to be deterministic would cost unnecessary storage space and processing time, and would be a burden to all users that don't require this level of determinism.

These sketches have been designed for systems where space and time are critically important, and we have optimized for that. Sorting arrays, for example, when not necessary, we will absolutely avoid. If the algorithm doesn't require a specific seed, why serialize a seed which would cost extra space and time?

How do you guys test serde correctness for the sketches?

That is our job. There is quite a bit of science and art into the testing, characterization and validation of these algorithms that we have spent years developing. We run billions of sketches against many TB of data in characterization tests that can take days to run. We have many customers that have used these sketches in production for years quite successfully.

The guarantee is that a deserialized sketch will return a result that is within the statistical error bounds of the original sketch and related confidence interval.

Period.

hpx7 commented 7 years ago

I'm starting to understand although I'm not sure if I'm fully convinced yet. I suppose I haven't had much experience working in domains where space and time are so critical that things like reproducibility may have to be sacrificed.

I'm still not sure if I agree with HLL not implementing .equals() though. It really seems like with the current implementation there is a way that you could write a correct .equals() implementation for the class. I suppose you haven't done so mostly because you don't really see the need for it.

Here's my use case: we store versioned blobs which contain serialized summaries about tables. As part of the summaries we serialize some sketches from this library. We have a deserialization test suite which helps us catch instances in which we need to bump the blobs version (i.e. when the serialization format changes). For examples, when updating this library from 0.9.1 to 0.10.0, I would like to know whether the binary format has changed and thus whether I should bump the blobs version, which will invalidate the currently serialized versions.

One strategy I could go with is just being conservative and always bumping the blob version when I update this library. But ideally I would have some way of being able to get e.g. patch version upgrades with the confidence that this won't be necessary. I understand you don't want to make guarantees about the serialization format across updates (you don't want me to depend on internal-apis), but a deserialization suite could ideally catch serialization format changes on our side.

Don't want to argue this point too much though; for now we've been implementing our own .equals() methods for these sketches for our deserialization test suite. For HLL we'll just use getEstimate() (with error bound) to test for equality as @leerho alludes to.

leerho commented 7 years ago

@hpx7

I'm still not sure if I agree with HLL not implementing .equals() though. It really seems like with the current implementation there is a way that you could write a correct .equals() implementation for the class. I suppose you haven't done so mostly because you don't really see the need for it.

No. It has nothing to do with perceived need. It has to do with impact on performance and impact on users misinterpreting what .equals() means.

For the QS sketch even if you feed the same identical stream in the same order to two different QS sketches the items that end up being retained by the two sketches will be different (here, even sorting won't help). However, the results, which can be represented by a distribution of values will be statistically the same within defined error bounds. Thus, we consider the two sketches to be "equivalent". But a .equals() method would return false!

Here, we are caught between a rock and a hard place. If we make the algorithm deterministic we risk some users experiencing really weird correlations in their data that would be hard to explain. If we make it random and provide a .equals() method it would be meaningless.

Here's my use case: we store versioned blobs which contain serialized summaries about tables. As part of the summaries we serialize some sketches from this library. We have a deserialization test suite which helps us catch instances in which we need to bump the blobs version (i.e. when the serialization format changes). For examples, when updating this library from 0.9.1 to 0.10.0, I would like to know whether the binary format has changed and thus whether I should bump the blobs version, which will invalidate the currently serialized versions.

I understand your issue here but we have solved it in a different way. We make a distinction between software version (code) and serialization version (data). Just because the software release version changes does not mean the SerVer changes. The SerVer is encoded with every serialized image of a sketch. If the SerVer is older than the current code supports by default, then the deserializing code automatically uses a conversion method so it can read the data it contains. This has allowed us to continue to perform read and merge operations on sketches that were serialized and stored 5 years ago. During that time (for Theta sketches) the SerVer has only changed twice and the Quantiles sketch has changed twice. All the other sketches are still on SerVer 1.

So from a Sketch user point of view, it is our design intent that you will never loose access to the data that the sketch statistically represents even if the SerVer changes and independent of SW releases. So far, we have been able to uphold that. And since we handle SerVer changes internally, the user should not care, and you really don't want to know. We have never guaranteed that a user can perform byte-level comparisons between sketch images.

for now we've been implementing our own .equals() methods for these sketches for our deserialization test suite.

I have tried to explain to you as completely and comprehensively as I possibly can why you should not be doing this, and proposed testing strategies that don't require it (using mocks). This is not something we can or will support and you are on your own.

hpx7 commented 7 years ago

Thank you for the detailed explanation @leerho!

For the HLL case couldn't you provide an .equals() method which compares the unordered set of coupons if it is in the warm-up stage? This wouldn't affect serialization performance; and since HLL is deterministic I don't think there is a risk of users misinterpreting its functionality.

I didn't consider the "accidental correlation" for the QS sketch, thank you for explaining that. A simple way to make it reproducible though would be to optionally take a seed in the builder (and randomize it if it is not provided).

It's good to know that you guys maintain backwards compatibly across SerVers! Do you think there is any value in exposing SerVer to users? One use case I can think of: let's say you guys reduce the space footprint of a sketch by 50%. It could be nice to be notified that the SerVer has been bumped so that we can invalidate the old sketch and have it recompute (allowing us to safely delete the old version of the sketch).

jmalkin commented 7 years ago

For quantiles, DoublesSketch and ItemsSketch both expose a Random() object in the abstract base class. So users have the option to set the seed, precisely for testing purposes.

Updated: The quantiles sketch does not serialize a seed, nor will it. Inducing correlations between sketches does more than just skew/bias the results. It fundamentally breaks the sketches. The probability guarantees will no longer apply; any results must be treated as wrong.

As with all other aspects of the sketching library, fixing the seed with a quantiles sketch is not guaranteed to produce identical results between different library versions, even minor point releases. Our only guarantees are around API compliance and error bounds.

I hope it goes without saying that fixing the seed in a production environment is a Very Bad Idea.

leerho commented 7 years ago

@hpx7

For the HLL case couldn't you provide an .equals() method which compares the unordered set of coupons if it is in the warm-up stage? This wouldn't affect serialization performance; and since HLL is deterministic I don't think there is a risk of users misinterpreting its functionality.

No. Just because some sketches in certain modes happen to be deterministic is an implementation detail. We don't want users implementing code against these hidden modes and then having to track these low level implementation details. We must have the freedom to change these hidden structures in order to allow us to extend these sketches in the future.

.equals()would not affect serialization performance, per se. However, implementing .equals() and .hashCode() would tempt users to use sketches as keys in hash tables, where the performance would be horrible! And then we would have to support such operations. We can't afford to do that and we won't.