apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
223 stars 71 forks source link

the getEstimate result for union of HllSketch is not stable #374

Closed dyf6372 closed 1 year ago

dyf6372 commented 1 year ago

We constructed multiple hllsketches, and then merged multiple hllsketches together using a union object. We found that the result of getEstimate is not stable if we shuffle the order of merging hllsketches, but the result of getCompositeEstimate is always consistent.

This is the demo program


import org.apache.datasketches.hll.HllSketch;
import org.apache.datasketches.hll.Union;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Base64;
import java.util.Collections;

public class App 
{
    public static void main( String[] args )
    {
        BufferedReader reader;

        ArrayList<HllSketch> arr_of_hll = new ArrayList<>();

        try {
            reader = new BufferedReader(new FileReader("data.txt"));
            String line = reader.readLine();

            while (line != null) {
                byte[] decodedBytes = null;
                try {
                    decodedBytes = Base64.getDecoder().decode(line);
                    HllSketch hll = HllSketch.heapify(decodedBytes);
                    arr_of_hll.add(hll);
                } catch (Exception e) {
                    System.out.println("not supported:" + e);
                }
                line = reader.readLine();
            }

            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        Union union = new Union(12);
        for (HllSketch sketch : arr_of_hll)
            union.update(sketch);
        System.out.println(union.getEstimate());
        //System.out.println(union.getCompositeEstimate());

        for (int j = 0; j <= 10; j++) {
            Union union2 = new Union(12);
            Collections.shuffle(arr_of_hll);
            for (HllSketch hllSketch : arr_of_hll)
                union2.update(hllSketch);
            System.out.println(union2.getEstimate());
            //System.out.println(union2.getCompositeEstimate());
        }
    }
}

And sample data

AwEHDAUIAAEJAAAAArXJFMZJ2g4Hb5AHkdJ4C/OLhhE06EMJODa1BrkbZwy8d+gG
AwEHDAUIAAEJAAAAF0UmCgdtvQzodV4HzepeBs91AQa0s2IGNZ4cEHevtwnc45oL
AwEHDAUIAAEJAAAA4ssACcV9vgUq5v8SLrlNBN8aUgcZOdwSujOQB7nMuQi/1e4E
AwEHDAUIAAEJAAAA5U7uBWgf/QRq7aMQLHy9Bg11Nwa4r8cOd9Z1ERgvMQr/LvcE
AwEHDAUIAAEKAAAABKLEBKojvwVOqGoJ7yRhB3F8gQVzWu0ENM3nBdjSVgVbriMJfKJGBg==
AwEHDAUIAAEKAAAA0PqVH0ql6gdrHQAHDkFZCPDCrQhfYe0MnDzrE70FmQVeBscF38MUCQ==
AwEHDAUIAAELAAAAIHxYBqLphQTCLgwH2j0ICoyFrwXQ+AISNL+ZBNUxEQSZr6sS2r3GCL7+rgk=
AwEHDAUIAAELAAAAQVaMDoLmywbFXK4FBkbpBS2yHQQO5tUEUbPUCJgBwQqz6SAKl9w8BVgPZgc=
AwEHDAUIAAELAAAAoW9mDQUPdgaLSlAOzikMBQ4USAjwFEUFbox4BZIuRwbTk3kMUg0JBl+Nrw8=
AwEHDAUIAAELAAAAwuq7GGfVvQeLSlAO7mhTBXZflRqSLkcGVgcuCVwTOgT816oS3czEBd41uQY=
AwEHDAUIAAEMAAAAALBICKG6kwfE7p8KKazdGm9LtQaMzLQSrZwwCC6vxg3PXg8HTi5sCG0ZlwybP/MI
AwEHDAUIAAEMAAAAQDrXBqEfmAYlIZUIZqPKBYffngnhNLsErZIKC3AgIQ2ShdIZTWnnBh2avQjfyGEF
AwEHDAUIAAEMAAAAwGoVB6ORRwZlTtYGCsClFjKHqwUNFp4EkYAmEHLjQhPzkD4G+L4MBRp65BPbUcQM
AwEHDAUIAAENAAAAIE9FCBBNLQvGqHAG6UIgExH3FyNs218HULxkBXHqFw4z3MkH9Nz6BznLYQUaN5sJntyWBQ==
AwEHDAUIAAENAAAA4+kpDSsbKAcFqpcEKbkyCsuiqQvtPjwKrr05EFF94wlyafEV9zuPEetdygYaeuQTPm+hEA==
AwEHDAUIAAEOAAAAQqcnBITrbAemBSMJyMTaBixiDAuNSzcHEWC7BtNtTg1UVBYIDQ/FBHnOsAn7wCcFHkRTBC1nQwo=
AwEHDAUIAAEPAAAAQoGyCMM9fQRFh7AHBo9bJskFeRGKyM8E5lDQBq7LcQcPojoFsDCeC5ZsPwRPYewTfNmYBR3+PgUfy48I
AwEHDAUIAAEPAAAAgJAABOQKUwVlcVAEiD/NCcvWYAXs838E7kUZBTbbiwt224IKlX6kE5bJvQ/ZRWAIefpLCPyxgw1/UXAK
AwEHDAUIAAEPAAAAl2Z6CIRWbjYE/FAOx4zjCSmGiwdM8q4ETUHvBQ4b1AYPeCQSUm7NBddVAgv4mJMI8gS4C3twMgx8qX4W
AwEHDAUIAAEQAAAAZZXEDadrxRJIYVEH6TtDDksXngosfL0GDrt2DPBABwTR8AkFlK0JClVM3QQVsGMFuK/HDs4lEgQeBycE/y73BA==
AwEHDAUIAAEQAAAAoLuCFQHRrgRi5bYLtiGeBGVW7hKn+nEPSsQ0B22p8w9OQE0KL7TREZMzmQe2uvgZVyQ9BKG/oAY9jhMH3rGtBQ==
AwEHDAUIAAERAAAAQtvqDJ80cRJ8BYMS6ZzjDzlq/AburcAO0PqVH14GxwX0Qp4J9W5DBriDYQnZRSQGOTgdDXvxigdcNa8Kfj4HBV9h7Qw=
AwEHDAUIAAESAAAAYN7UCqOrTwYlblgG6bTjCArZdgTLw4gO7W1zCLZ3Lw/jdacF81ahDpRbWgf2ouwJd8Q1BlhIBgV2tNsIvLe5Cv0MsQT+tMUL
AwEHDAUIAAESAAAAu0hSCmGFMgaCzasJr0g/BQU+zAWm64QGqNAiCgnzXguLFQYEz5MyBK9mywa0rJsEGE9JDbn0GBVIMkENvCI9C92IlQpeVlQL
AwEHDAUIAAEVAAAAQJDTBQK1yRRNC18JZEawBoXinAXnGA8KYgT8CMozbw9CyH0TedbiCG22OAuOaGwIkfKCC1zhBArzl2wHyMo2E39OhAjKWdUJO079CZzZ7QdZZJIO
AwEHDAUIAAEVAAAAgAJuD+FLrwml1qkZpYUzCAZ1cAUn6cQOqvsaCQp0VhDtYxoF+RgfBa+y3gRRML4SkmKHCRWusggW9qwTbRmXDNmuEwWNSSMOVkCiBEb5oAZ+Wi4K
AwEHDAUIAAEVAAAA0pQXBfP6BwxXzd8KxX2+BSbFHgXnr1sJKub/EgtrYQiMy5wFDdTmB0t0DwoHDToKEc5cCVJE8AfspIsEFBTzBsfoQAcW/2wMSwGhCzBZ/w3fGlIH
AwEHDAUIAAEWAAAAIYUqBcUYQQZGK80Tx/q0C0hs+gtpN8gHimCMBKbc9AQtWAkJyOQjBBD5ORFRxBYJM4tGBV9jLAfYofMHN9YGEjii0QUPP7QKWNzxBPyUgCNeC+AHP525DQ==
AwEHDAYIAAEaAAAAA5+QBHpzmAXIvWgLy4lNBMyGYw/PMJIM02moC5j17QjZgL8JGh+mBdxUQQid3q0GI6xxB2Q8jwp/czQFMp+rCNoVywXyGrUXc9XNBzRDIA03LooI+H0xCHkbewu66sMFPCpxBj8xrws=
AwEHDAYIAAEbAAAAgV3iCkVjBgQcjRoEx0rRBInnpwQLZYUGDLxwHdG/FAsTar0FFRNIBlb45gbXuckInOJhBJ243w3n/Q4HoqAmCyN+qQjklSoQ55vYByPyfBPuDzsJ7nGzBe8DohP22t8IOp3hDPvQVwj/hRYH

And output

432.06531374839653
433.144452686294
429.9279343752412
430.99806224307883
432.05635441694574
430.9869972381987
430.983505239331
428.8676074062511
431.0056561273326
429.9337714699233
432.0610544324312
433.14998954166344
jmalkin commented 1 year ago

The example code was Java, but since the behavior will be the same in this repo's C++ or Java we can just address this here.

An HLL sketch moves through up to 3 different modes over its lifetime. We start in what we call a "coupon collection" phase, which itself has 2 steps. The first is to store a list of coupons. That fairly quickly turns into a hash set. When the size of the hash table would exceed the size of the final HLL array, we transition into that final HLL mode.

Additionally, there are several estimators. The basic one is the composite estimator, which itself is a mix of several models. That one is order-independent. But we also have what is known as a HIP (Historic Inverse Probability) estimator, which provides more accurate estimates but at the cost of being order-dependent. That HIP estimator works as long as we have the inputs presented in order -- and we can actually be a little more general and remain in that mode as long as we are processing raw inputs or coupons.

Because the HIP estimator stores a little extra information, we can actually use that even when the sketch has transitioned into HLL mode as long as we have only fed in raw inputs or coupons. The ability to use the HIP estimator remains until we union two sketches already in HLL mode.

There are 3 ways to query a sketch's estimate:

When I looked at the set of input sketches, all of them were in SET mode. As a result, getEstimate() was using the order-dependent HIP estimator, which is why you saw varying results. Calling getCompositeEstimate() returned a stable result, but on average that will have slightly larger error. The choice of estimator here lets you make a trade-off between accuracy and order-independence.

This behavior is ultimately a feature, not a bug. Please feel free to ask if you have additional questions.

dyf6372 commented 1 year ago

@jmalkin Thank you very much for your meticulous reply, the complete program is implemented in C ++, after finding this behavior, I wrote a small demo with Java to reproduce the same behavior.