aggregateknowledge / java-hll

Java library for the HyperLogLog algorithm
http://research.neustar.biz/2013/12/24/open-source-release-java-hll/
Apache License 2.0
311 stars 70 forks source link

Out-of-bounds when a blank HLL is unioned with a smaller blank HLL #11

Open Akninirith opened 9 years ago

Akninirith commented 9 years ago

A union operation done on two HLL objects immediately after they are initialized, with a smaller log2m in the parameter HLL than the one doing the union, throws an ArrayIndexOutOfBoundsException. The below code should be capable of reproducing this error.

HLL shortHLL = new HLL(14, 6, -1, false, HLLType.FULL); HLL longHLL = new HLL(18, 6, -1, false, HLLType.FULL); longHLL.union(shortHLL);

ghost commented 9 years ago

@Akninirith Thanks for pointing this out! This should throw an IllegalArgumentException for now, since I haven't reasoned through all the rules for union-ing mixed parameter HLLs.

If you have an immediate use case involving the union of different log2m HLLs, I'd be happy to commit a "folding" static method that would produce a copy of the provided HLL at smaller log2m.

Akninirith commented 9 years ago

There isn't a use case for this, at the moment at least, but if that changes this thread will be notified ASAP. Regardless, thank you for the timely response!

yerenkow commented 9 years ago

Is this discussed to be case unioning any kind of HLL with EMPTY/EXPLICIT ? Am I correct that for different log2m & regwidth there will be different subsets of bits and different values for counters (both counter # and value to be set), so, basically this would be no-go to correctly merge one counters to another? Or there are cases when this possible?

rcrezende commented 6 years ago

@ghost one scenario is hundreds of thousands of sparse HLL_s (or even millions) with a small regwidth_s is stored in disk. Later, merging happens into a single full HLL_m that can represent bigger cardinality values (regwidth_m>=regwidth_s) just for the purpose of computing the cardinality of the union. For this scenario, HLL_m would not be persisted. That might save disk storage and even reduce read cost.

DanySK commented 4 years ago

This may fail suprisingly, though. I ran into this problem while implementing the algorithms proposed in this paper.

Even though it's a known issue that won't get solved as it is deemed unimportant / too cornery, I think it should at least get accurately documented.