[Discuss] Replacing hyperUnique as 'default' distinct count sketch

gianm commented 5 years ago

Branching off a discussion from #6743. See @leerho's comment https://github.com/apache/incubator-druid/issues/6743#issuecomment-449490568 for a rationale as to why it could make sense to transition to HllSketch. However, it would be a delicate transition, requiring solutions to the following problems.

The on-disk format is not compatible, and cannot be, due to the difference in hash functions used. So the migration would be an extended one, and we should expect that some users might never migrate, due to an inability to reindex data. (Not all users retain copies of their raw data, despite the fact that it is a best practice.)
The fact that the new one is in an extension and the old one is in core presents the opportunity for user confusion. Ideally they'd both be in core or both be in extensions.
Druid SQL's COUNT(DISTINCT x) operator uses hyperUnique currently. Ideally, it would adapt to use whatever sketch aggregator matches your segments, when run on a complex column. And if you run it on strings, it should use the 'best' available.

Alternative approaches?

Patching hyperUnique's implementation to improve its error characteristics. I'm not sure if this is possible while retaining the same on-disk format. If not, it would need to involve an ability to read both the current format and a new format.

b-slim commented 5 years ago

Druid SQL's COUNT(DISTINCT x) operator uses hyperUnique currently.

This is side comment, but i think it is miss leading to expose the approximate count distinct as the default count distinct since it does not adhere to SQL standard.

Now to the issue of moving form Hyper unique to HllSketch I am kind of sure this kind of question will re occur again and again and every-time that a new approximate method outperform a an old one or maybe offers different tradeoffs. This tells me that probably the best way to solve this is to add a built in UDF for every different sketch algorithm with its respective parameter, this will give the user access to all the core supported sketches without issue of compatibilities.

The fact that the new one is in an extension and the old one is in core presents the opportunity for user confusion. Ideally they'd both be in core or both be in extensions.

Having HllSketch as part of core makes perfect sense to me the library has minimal dependency and does very well on what is suppose to do

leerho commented 5 years ago

The on-disk format is not compatible, and cannot be, due to the difference in hash functions used.

Let me clarify the problem with the Druid-HLL sketch with respect to its use of Hash Functions.

The Druid-HLL sketch delegates the responsibility of hashing the input items to the user.

The only requirement is that the resulting hash be at least 10 bytes long. Although this may seem to be advantageous by providing user flexibility, it has very serious drawbacks.

When sketches are serialized and stored, the resulting data is a highly compressed jumble of bits similar to a picture image. Unless you understand how to decode the sequence of bits and know exactly how it was encoded, the sketch image is not very useful. The chosen hashing function (including any seeding of the hash) determines how the bits are encoded. Although it is not required to know what hash function (and seed) was used to encode the sketch in order to decode it and obtain an estimate, any attempt to merge two sketches will result in garbage unless the same exact hash function (and seed) was used to create the two sketches.

Suppose a user in department A creates a history of sketches using one hash function, and a user in department B creates a history of sketches using a slightly different hash function. These two histories can never be merged together. The Druid-HLL sketch therefore must rely on human coordination between the users that created and stored the sketches, and the users that wish to merge these sketches later on in some query process. In fact these two groups of users could be separated not only by department, but by company and/or long periods of time.

The Druid-HLL sketch has no means of detecting that two sketches may have been created using different hash functions (or seed). in other words, this is a silent failure mechanism.

The DataSketches-HLL sketch always uses the same hash function (and seed) and it cannot be changed by the user.

This was an intentional design decision to prevent the above disaster that is almost guaranteed to happen in large corporations.

The DataSketches-Theta (and Tuple and CPC) sketch families always use the same hash function, but do allow users to optionally specify a different seed.

This was a specific design decision to provide certain groups that must sketch sensitive data with a bit more security in their stored data. The chosen seed is never stored in the sketch image for this reason.

However, the Theta sketches have a detection mechanism as a safety measure to detect if different seeds were used in the creation of the sketch images. All merge operations first check if the sketches are, in fact, compatible, and throw an exception if not.

Patching hyperUnique's implementation to improve its error characteristics.

The design flaw that is causing the poor error on merging is fundamental. The Druid-HLL sketch is not keeping sufficient information in its stored image to decode it accurately. Therefore, there is no fancy algorithm that exists that could reconstruct this missing data. And the Druid-HLL sketch provides no warning mechanism that these estimation errors will occur.

Even if it was fixed with a new format that would retain the missing data, merging old sketches with new ones will still have the same error problems.

The flaws in the design of the Druid-HLL sketch don't stop here, it has many other design problems. I see no benefit to attempting to "fix" the Druid-HLL sketch as it would require a complete redesign from scratch. -- But we have already designed a far superior HLL sketch as part of the DataSketches library. I strongly suggest we document and communicate these issues to the user community, and suggest they migrate to the DataSketches library.

gianm commented 5 years ago

The Druid-HLL sketch delegates the responsibility of hashing the input items to the user.

This isn't true in practice. The druid-hll library lets callers use any hash function, but Druid doesn't expose that to end users. It always uses Hashing.murmur3_128() to avoid the problems you mentioned.

The design flaw that is causing the poor error on merging is fundamental. The Druid-HLL sketch is not keeping sufficient information in its stored image to decode it accurately.

Out of curiosity what is the missing information? I didn't see details on that in the link you provided.

I see no benefit to attempting to "fix" the Druid-HLL sketch as it would require a complete redesign from scratch.

The benefit would be giving users a migration path from a possibly large number of already-stored sketches, other than "sorry, you have to reindex all of your historical data". Which, for some users, may be inconvenient or impossible. Better would be "you can migrate to a new format for newly ingested data, and it will give you better behavior for that new data, degrading to the old behavior if you query a time range that covers both the old and new formats." There is value in that if it is possible. (I don't feel like I understand the details well enough to say if it is possible.)

But we have already designed a far superior HLL sketch as part of the DataSketches library. I strongly suggest we document and communicate these issues to the user community, and suggest they migrate to the DataSketches library.

It does sound like you have designed a better one and thank you for that!

gianm commented 5 years ago

Now to the issue of moving form Hyper unique to HllSketch I am kind of sure this kind of question will re occur again and again and every-time that a new approximate method outperform a an old one or maybe offers different tradeoffs. This tells me that probably the best way to solve this is to add a built in UDF for every different sketch algorithm with its respective parameter, this will give the user access to all the core supported sketches without issue of compatibilities.

I think we want to do that too (like, provide Druid SQL functions so users can choose whether to do approx count distinct with druid-hll, datasketches-hll, datasketches-theta, what-have-you). I think it'd also be nice to also have a generic APPROX_COUNT_DISTINCT function that uses the "correct" sketch aggregator based on what format of sketch you have actually stored in your segments. Something that makes life easier for users. And maybe give it a concept of the 'current best' one to use, and have it use that if you don't specify a specific one.

leerho commented 5 years ago

This isn't true in practice. The Druid-HLL library lets callers use any hash function, but Druid doesn't expose that to end users. It always uses Hashing.murmur3_128() to avoid the problems you mentioned.

Well that is a relief! This reveals my lack of understanding of the internals of Druid, as I was just looking at the design of the HLL classes, and I shudder by what I see! :)

Out of curiosity what is the missing information? I didn't see details on that in the link you provided.

It is complicated and requires deep understanding of the HLL Stochastic process and how it evolves as a function of n. Nonetheless, I will attempt to explain it.

The HLL algorithm is usually implemented with at least 2 distinct phases: sparse and dense. The sparse phase is used for very low counts and saves space by not needing to allocate the full HLL dense array. At some point, the sketch will graduate to dense mode and allocate the final full array of k slots. (e.g., if logK=11, k=2048 slots). For practical reasons a slot must be at least 6 bits. Simple implementations of HLL assign a byte (8 bits) per slot so the space required for the sketch in dense mode (logK=11) would be 2048 bytes plus some header bytes. More sophisticated implementations, like the Druid-HLL and the DataSketches-HLL, perform a form of compression so that each slot only requires 4-bits. So now the total space required for the sketch would be 1024 bytes plus some header bytes. However, this compression scheme only works "most of the time" and occasionally 4 bits is not enough. When this happens, a special "exception value" is placed in the relevant slot which defers the lookup of the true value for that slot from small hash-map of exception values keyed off the slot address. How often these "exceptions" occur is a function of the size of the sketch, n, and probabilistic occurrence of certain hash values. So it is more likely to happen when merging lots of relatively large sketches together, which is what we do in our test.

We know from our own extensive characterization studies (that sometimes run for days!) that an HLL 4-bit sketch implementation where logK=11 needs at least 4 exception registers, and could require more, although it would be extremely rare; nevertheless, it can never be ruled out. In theory, for a LogK=11 sketch, each exception register would need only 24 bits. Since the DS-HLL sketch allows up to LogK=21, we use 32 bits per exception register.

The mistake that the developers of the Druid-HLL sketch made was that they only provided space for one exception register of 24 bits. Due to insufficient understanding of the probabilities of exceptions and insufficient testing, they assumed only one register was enough and did not think through what the consequences would be. You can see this at the top of the HyperLogLogCollector class: The Header has only 7 bytes. Byte 4 holds the exception value and bytes 5 and 6 hold the slot address. The total space allocated for the sketch is 1024 + 7 = 1031 bytes.

If more than one slot (of the 2048 slots) needs an exception register, there is no place to record the exception, so the prior exception's record is overwritten. Thus, information is lost. The state of the HLL array is now corrupted and accurate recovery of the estimate is no longer possible. The consequence is severe undercounting, which is revealed in our testing.

I fully understand what the benefits would be, if it were possible to correct for this flaw. But what you are asking for is the ability to merge the flawed HLL sketches with some new fixed HLL sketch. The 4-bit compression algorithm is rather complex and would have to be completely be reviewed, redesigned and tested. I don't have the time or motivation to attempt this type of redesign.

Even though we also use the MurmurHash3_64_128 hash function, we do use a specific static final non-zero seed. Thus our hash functions are incompatible. I'm not even sure that the guava hash implementation would necessarily produce the same bits even if the seed issue didn't exist, but this could be verified.

Even producing a variation of our HLL sketch that could merge with the Druid-HLL sketch images would require considerable work, as we would have to extend our design to allow for variable seeds. Then we would have to design an adaptor that could merge from the Druid-HLL sketch, which is also not trivial.

I will keep thinking about this, but don't get your hopes up :)

leerho commented 5 years ago

@gianm It would help me a great deal if you could list the methods of HyperLogLogCollector (including constructors) that are actually used in Druid.

Thanks!

gianm commented 5 years ago

@leerho "Find usages" on each of those methods in IntelliJ (or the equivalent in Eclipse) should bring it to light relatively quickly. All the relevant usages would be: anything in Druid modules that are not druid-hll. (Since anything that is only used by druid-hll is an "internal" thing.)

leerho commented 5 years ago

@gianm OK, I was hoping to avoid forking the entire incubator-druid repo :) But I bit the bullet and have spent quite a bit of time getting Eclipse set up right. Nonetheless, I have eliminated most of the original problems in Eclipse. Now I am down to 5 errors that I just can't seem to fix in Eclipse without trying to modify the code (which I don't want to do). I would appreciate a little help in getting these few problems resolved.

I forked apache/incubator-druid Jan 11, 2019 commit #6815 <5d2947c> which has a blue check, so somehow the commit compiled ok.

Nonetheless, I show the following compile errors:

druid-processing/org.apache.druid/query.aggregation.post/ExpressionPostAggregator.java L144-L155:

Error: The constructor ExpressionPostAggregator(String, String, String, ExprMacroTable, aggregators.entrySet().stream().collect(Collectors.toMap(( entry) -> entry.getKey(), ( entry) -> entry.getValue()::finalizeComputation))) is undefined.

druid-server/org.apache.druid.server.coordinator/DruidCoordinator.java L691-L692:

Error: Cannot refer to 'this' nor 'super' while explicitly invoking a constructor.

druid-server/org.apache.druid.server/AsyncQueryForwardingServlet.java L535-L540:

Error: The method makeRequestMetrics(GenericQueryMetricsFactory, QueryToolChest<T,Query>, Query, String) in the type DruidMetrics is not applicable for the arguments (GenericQueryMetricsFactory, QueryToolChest, Query, String).

druid-server/org.apache.druid.server.coordination/ServerManager.java L324:

Error: The type QueryMetrics does not define reportSegmentAndCacheTime(Object, long) that is applicable here

druid-server/org.apache.druid.server.coordination/ServerManager.java L325:

Error: The method segment(String) is undefined for the type Object

Also I found several instances where code in the /main/... path depends on code in the /test/ path. This is a bad practice and I don't think would run in production. Specifically druid-benchmarks has dependencies on druid-processing/target/test-classes and druid-sql/target/test-classes. This, of course, begs the question of why benchmark code is in /main/ in the first place?

Any help would be appreciated,

Thanks.

b-slim commented 5 years ago

@leerho are you able to build the project via running mvn clean install -DskipTests that might solve it ?

leerho commented 5 years ago

Strange: When I mvn clean install -DskipTests from the command line it builds OK. But for some reason in Eclipse I get the above 5 errors. So there must be some compiler setting in Eclipse that is more strict than building from the CL. I have reviewed all the compiler error/warning settings and nothing is more severe than a warning. I am compiling with JDK 1.8.0_162.

Attempting Maven/Update Project... doesn't help either.

b-slim commented 5 years ago

A surface level search for error 5 i see the following thread not sure if it is the issue but most likely it is some classpath definition issue.

jon-wei commented 5 years ago

Hm, I am trying with Eclipse and getting the same 5 errors.

Error 2 may be an Eclipse bug: https://bugs.eclipse.org/bugs/show_bug.cgi?id=530748

I'm not sure about the others yet.

jon-wei commented 5 years ago

Maybe related to error 1: https://bugs.eclipse.org/bugs/show_bug.cgi?id=531094

gianm commented 5 years ago

FWIW, I use IntelliJ and it works more or less out of the box. (You do need to run mvn clean package -DskipTests from the command line one time before it works fully, since that runs some code-generation steps that the IDE won't do on its own.)

gianm commented 5 years ago

@leerho You can also check the following files to see how HyperLogLogCollector gets used (all the meaningful usages should be in them):

CardinalityAggregatorFactory
HyperUniquesAggregatorFactory
CardinalityBufferAggregator
HyperUniquesBufferAggregator
HyperUniquesSerde
PreComputedHyperUniquesSerde

Seems to be estimateCardinality, add(byte[]), fold(ByteBuffer), fold(HyperLogLogCollector), compareTo, toByteArray, and toByteBuffer.

leerho commented 5 years ago

@b-slim @jon-wei With respect to the errors above, Alex and I have reduced the number to 3. 2 of the errors are caused by raw generic types not being qualified. I will submit a PR.

As I mentioned earlier, there are backward dependencies from the /main/ tree to the /test/ tree that also need to be corrected. I will submit an issue on these.

leerho commented 5 years ago

Back to the main topic of this issue-thread, we have discussed internally many alternatives to see if there is some way to transition "smoothly" from historically generated Druid-HLL sketches to the DataSketches-HLL (DS-HLL) sketch. There just isn't one. It is very ugly no matter how you try to do it and you still end up with results with errors.

I would recommend that we document that the Druid-HLL sketch has serious problems and encourage users to move to the DS-HLL sketch. Those users that have Druid-HLL history will just have to live with the history and try to either re-sketch their history (If it exists!) or somehow create a clean-line-in-the-sand and move forward with the DS-HLL sketch.

Note that the DS-HLL sketch is extensively used here at Yahoo with history that goes back several years and is actively maintained. Not only do we have strong backward compatibility, but we also have versions of the DS-HLL sketch in C++ (with Python on the way) that are all binary compatible using the same stored images. So you can generate DS-HLL sketches in C++ and interpret and merge them in Java (and soon Python) or visa-versa. There is no other sketch library that offers that!

drcrallen commented 5 years ago

We recently went through some in-depth looks at the HLL implementation in Druid and @leerho is mostly correct on the Druid HLL shortcomings. The reason we haven't switched over to Theta Sketches internally is that for this specific case the devil you know can be better than the devil you don't. We have a lot of experience with the segment size impact and computational impact of the Druid HLL sketch (capacity and resource planning). As such I do not think removing Druid-HLL is a valid path forward.

Making Data Sketches the default recommended HLL (and just retain documentation for the legacy one) would make sense to me. I think the module system is pretty solid in Druid (despite a few hadoop related issues), so I would prefer if HLL were moved to a core module instead of the other way around.

For a few items such as cardinality aggregations against dimensions or dimension value tuples it seems like a good idea to me to move it to the DataSketches library instead. If for no other reason than that is is designed to be resolved at query time instead of ingestion time, and thus the "correctness" of the value hashing is not at all controllable. So the Data Sketches lib would be a natural choice for the query-time HLL sketch operations.

leerho commented 5 years ago

The reason we haven't switched over to Theta Sketches internally is that for this specific case the devil you know can be better than the devil you don't. We have a lot of experience with the segment size impact and computational impact of the Druid HLL sketch (capacity and resource planning).

Clarification: Switching from the Druid-HLL sketch to the DataSketches-Theta sketch family makes no sense. The Theta family of sketches are much larger sketches and provide set intersection functionality that the HLL sketches do not. What I have been proposing is to switch over to the DataSketches-HLL sketch; which has comparable space utilization but with proven accuracy and speed advantages over the Druid-HLL sketch. I think that this is what you meant to say.

gianm commented 5 years ago

As such I do not think removing Druid-HLL is a valid path forward.

100% agree. We do not ever want to break compatibility with existing on-disk segments, so we need to keep it.

I think the module system is pretty solid in Druid (despite a few hadoop related issues), so I would prefer if HLL were moved to a core module instead of the other way around.

Druid core does need some kind of HLL implementation, so they can't all be moved to modules. It's used by DetermineHashedPartitionsJob and IndexTask to determine partitions. We might end up wanting to use it for other things too, like tracking dimension cardinality statistics on the broker to aid in query planning.

Another consideration is that HLL-based counting functionality is a pretty standard thing in databases these days. Offering it as a core thing is nearly universal.

A final consideration is the interaction with Druid SQL. In Druid SQL, by default, COUNT(DISTINCT x) uses HLL. You can set useApproximateCountDistinct = false but by default it's true. There have been objections raised to this (https://github.com/apache/incubator-druid/issues/6814#issuecomment-452056269), but I think it's a good choice for Druid, since it adheres to the Druid philosophy that fast is best, and approximations are how you get there. I feel that doing it in exact mode by default would be un-Druidy. It affects the decision of where to put HLL, since the strategy of count-distinct being approximate by default is only feasible if there's an HLL implementation in core Druid.

drcrallen commented 5 years ago

side note

Dependency wise the extension looks really clean. The only extra it brings in is apache commons math3 and stuff from com.yahoo.datasketches.

What is the likelihood com.yahoo.datasketches or org.apache.commons:commons-math3 will have the same kind of problems Jackson or Guava have with version incompatibilities with other big data handling systems (spark, hive, hadoop, flink, etc)?

$ mvn dependency:tree -pl extensions-core/datasketches
[INFO] Scanning for projects...
[INFO] 
[INFO] -----------< org.apache.druid.extensions:druid-datasketches >-----------
[INFO] Building druid-datasketches 0.13.0-incubating-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ druid-datasketches ---
[INFO] org.apache.druid.extensions:druid-datasketches:jar:0.13.0-incubating-SNAPSHOT
[INFO] +- com.yahoo.datasketches:sketches-core:jar:0.12.0:compile
[INFO] |  \- com.yahoo.datasketches:memory:jar:0.12.0:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.6.1:compile
[INFO] +- org.apache.druid:druid-core:jar:0.13.0-incubating-SNAPSHOT:provided
[INFO] |  +- commons-codec:commons-codec:jar:1.7:provided
[INFO] |  +- commons-io:commons-io:jar:2.5:provided
[INFO] |  +- commons-lang:commons-lang:jar:2.6:provided
[INFO] |  +- org.apache.commons:commons-compress:jar:1.16:provided
[INFO] |  +- org.apache.commons:commons-dbcp2:jar:2.0.1:provided
[INFO] |  |  +- org.apache.commons:commons-pool2:jar:2.2:provided
[INFO] |  |  \- commons-logging:commons-logging:jar:1.1.1:provided
[INFO] |  +- commons-pool:commons-pool:jar:1.6:provided
[INFO] |  +- org.skife.config:config-magic:jar:0.9:provided
[INFO] |  +- org.hibernate:hibernate-validator:jar:5.1.3.Final:provided
[INFO] |  |  +- org.jboss.logging:jboss-logging:jar:3.1.3.GA:provided
[INFO] |  |  \- com.fasterxml:classmate:jar:1.0.0:provided
[INFO] |  +- javax.el:javax.el-api:jar:3.0.0:provided
[INFO] |  +- com.google.guava:guava:jar:16.0.1:provided
[INFO] |  +- com.google.inject:guice:jar:4.1.0:provided
[INFO] |  |  +- javax.inject:javax.inject:jar:1:provided
[INFO] |  |  \- aopalliance:aopalliance:jar:1.0:provided
[INFO] |  +- com.google.inject.extensions:guice-multibindings:jar:4.1.0:provided
[INFO] |  +- org.jdbi:jdbi:jar:2.63.1:provided
[INFO] |  +- joda-time:joda-time:jar:2.9.9:provided
[INFO] |  +- org.apache.logging.log4j:log4j-api:jar:2.5:provided
[INFO] |  +- org.apache.logging.log4j:log4j-core:jar:2.5:provided
[INFO] |  +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.5:provided
[INFO] |  +- org.apache.logging.log4j:log4j-jul:jar:2.5:provided
[INFO] |  +- org.apache.logging.log4j:log4j-1.2-api:jar:2.5:provided
[INFO] |  +- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.12:provided
[INFO] |  +- io.airlift:airline:jar:0.7:provided
[INFO] |  +- io.dropwizard.metrics:metrics-core:jar:4.0.0:provided
[INFO] |  +- net.thisptr:jackson-jq:jar:0.0.7:provided
[INFO] |  |  \- org.jruby.joni:joni:jar:2.1.11:provided
[INFO] |  |     \- org.jruby.jcodings:jcodings:jar:1.0.13:provided
[INFO] |  +- it.unimi.dsi:fastutil:jar:8.1.0:provided
[INFO] |  +- com.opencsv:opencsv:jar:4.2:provided
[INFO] |  |  +- org.apache.commons:commons-lang3:jar:3.7:provided
[INFO] |  |  +- org.apache.commons:commons-text:jar:1.3:provided
[INFO] |  |  +- commons-beanutils:commons-beanutils:jar:1.9.3:provided
[INFO] |  |  |  \- commons-collections:commons-collections:jar:3.2.2:provided
[INFO] |  |  \- org.apache.commons:commons-collections4:jar:4.1:provided
[INFO] |  +- org.mozilla:rhino:jar:1.7R5:provided
[INFO] |  +- org.tukaani:xz:jar:1.8:provided
[INFO] |  +- com.github.luben:zstd-jni:jar:1.3.3-1:provided
[INFO] |  +- com.jayway.jsonpath:json-path:jar:2.3.0:provided
[INFO] |  |  \- net.minidev:json-smart:jar:2.3:provided
[INFO] |  |     \- net.minidev:accessors-smart:jar:1.2:provided
[INFO] |  +- org.antlr:antlr4-runtime:jar:4.5.1:provided
[INFO] |  +- com.lmax:disruptor:jar:3.3.6:provided
[INFO] |  +- com.google.code.findbugs:jsr305:jar:2.0.1:provided
[INFO] |  +- net.java.dev.jna:jna:jar:4.5.1:provided
[INFO] |  +- javax.validation:validation-api:jar:1.1.0.Final:provided
[INFO] |  +- org.asynchttpclient:async-http-client:jar:2.5.3:provided
[INFO] |  |  +- org.asynchttpclient:async-http-client-netty-utils:jar:2.5.3:provided
[INFO] |  |  |  \- io.netty:netty-buffer:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-codec-http:jar:4.1.29.Final:provided
[INFO] |  |  |  \- io.netty:netty-codec:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-handler:jar:4.1.29.Final:provided
[INFO] |  |  |  \- io.netty:netty-transport:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-codec-socks:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-handler-proxy:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-transport-native-epoll:jar:linux-x86_64:4.1.29.Final:provided
[INFO] |  |  |  +- io.netty:netty-common:jar:4.1.29.Final:provided
[INFO] |  |  |  \- io.netty:netty-transport-native-unix-common:jar:4.1.29.Final:provided
[INFO] |  |  +- io.netty:netty-resolver-dns:jar:4.1.29.Final:provided
[INFO] |  |  |  +- io.netty:netty-resolver:jar:4.1.29.Final:provided
[INFO] |  |  |  \- io.netty:netty-codec-dns:jar:4.1.29.Final:provided
[INFO] |  |  +- org.reactivestreams:reactive-streams:jar:1.0.2:provided
[INFO] |  |  +- com.typesafe.netty:netty-reactive-streams:jar:2.0.0:provided
[INFO] |  |  \- com.sun.activation:javax.activation:jar:1.2.0:provided
[INFO] |  +- org.hyperic:sigar:jar:1.6.5.132:provided
[INFO] |  +- org.gridkit.lab:jvm-attach-api:jar:1.2:provided
[INFO] |  \- io.netty:netty:jar:3.10.6.Final:provided
[INFO] +- org.apache.druid:druid-processing:jar:0.13.0-incubating-SNAPSHOT:provided
[INFO] |  +- org.apache.druid:druid-hll:jar:0.13.0-incubating-SNAPSHOT:provided
[INFO] |  +- org.apache.druid:extendedset:jar:0.13.0-incubating-SNAPSHOT:provided
[INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.7.36:provided
[INFO] |  |  \- org.roaringbitmap:shims:jar:0.7.36:provided
[INFO] |  +- com.ning:compress-lzf:jar:1.0.4:provided
[INFO] |  +- com.google.errorprone:error_prone_annotations:jar:2.3.2:provided
[INFO] |  +- com.ibm.icu:icu4j:jar:54.1.1:provided
[INFO] |  +- org.lz4:lz4-java:jar:1.5.0:provided
[INFO] |  +- org.mapdb:mapdb:jar:1.0.8:provided
[INFO] |  +- org.ow2.asm:asm:jar:5.2:provided
[INFO] |  +- org.ow2.asm:asm-commons:jar:5.2:provided
[INFO] |  |  \- org.ow2.asm:asm-tree:jar:5.2:provided
[INFO] |  +- org.checkerframework:checker:jar:2.5.7:provided
[INFO] |  \- org.apache.maven:maven-artifact:jar:3.6.0:provided
[INFO] |     \- org.codehaus.plexus:plexus-utils:jar:3.0.15:provided
[INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.core:jackson-core:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.datatype:jackson-datatype-guava:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.datatype:jackson-datatype-joda:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.dataformat:jackson-dataformat-smile:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:jar:2.6.7:provided
[INFO] |  +- com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:jar:2.6.7:provided
[INFO] |  \- com.fasterxml.jackson.module:jackson-module-jaxb-annotations:jar:2.6.7:provided
[INFO] +- com.fasterxml.jackson.jaxrs:jackson-jaxrs-smile-provider:jar:2.6.7:provided
[INFO] +- junit:junit:jar:4.12:test
[INFO] |  \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- org.easymock:easymock:jar:3.4:test
[INFO] |  \- org.objenesis:objenesis:jar:2.2:provided
[INFO] +- org.apache.druid:druid-core:test-jar:tests:0.13.0-incubating-SNAPSHOT:test
[INFO] \- org.apache.druid:druid-processing:test-jar:tests:0.13.0-incubating-SNAPSHOT:test
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.167 s
[INFO] Finished at: 2019-01-16T10:29:03-08:00
[INFO] ------------------------------------------------------------------------

leerho commented 5 years ago

HLL based counting functionality is a pretty standard thing in databases these days.

True. And this is why a number of database companies have adopted the DataSketches library into their core code for performing internal query optimization and other internal data analysis tasks. Not only to use HLL or Theta for unique counting, but also to use the Quantiles sketches for understanding data distributions, and the Frequency sketches for capturing “Top N” data items that may require special handling because of their large number of occurrences.

leerho commented 5 years ago

DataSketches and version compatibility going forward ...

We already have intigrations with Hadoop Hive, Pig, and a little Spark, and our #1 commitment is to maintain binary compatibility going forward. The APIs may evolve over time but backward compatibility of the stored images is critical. As our library evolves with implementations in C++ and Python we will have binary compatibility across languages as well.

drcrallen commented 5 years ago

The APIs may evolve over time

That's what breaks guava and jackson for us all the time

dylwylie commented 5 years ago

We've been evaluating the Datasketches implementation of HLL versus Druid's own implementation.

In early testing we've noticed that performance in the datasketches version seem significantly worse with various configurations.

Curious if anyone else has seem similar results? When aggregating from a column of HLL values it looks like proportionally a lot more time is spent on:

org.apache.druid.query.aggregation.datasketches.hll.HllSketchObjectStrategy.fromByteBuffer

Maybe the benchmarks don't consider deserialisation time hence why we see contradictory results? Happy to put together more detailed information.

leerho commented 5 years ago

@Dylan1312 Sorry, I didn't see your comment until just now. I would be very interested in understanding what you are observing:

What is your configuration of the HLL sketch? What is LgK? HLL4,6,8?
What properties are you measuring, merge time? You mention deserialization time, but if you data is off-heap, there shouldn't be much deserialization. So I'm a bit confused.
You say it is "worse"... could you report your measurements? how much slower?
What the size of the data (# of sketch merges)
Do you happen to know anything about the distribution of your data? Long-tail? Lots of small values? or large values?
Are you also comparing the accuracy?

Cheers, Lee

dylwylie commented 5 years ago

Hi Lee,

Thanks for the response!

I've tried various configurations, each of HLL4,6&8 and lgK values of 6&12 for each.
I'm comparing the time to complete timeseries queries over a small number of segments.

One using a hyperUnique (Druid's native hll) aggregator on a column of hyperUnique hll sketches, versus various queries each using a single HLLSketchMerge aggregator against a column ingested with HLLSketchBuild.

I noticed that the aggregator spends a significant portion of its time passing a bytebuffer to HLLSketch::wrap, deserialization may be the wrong term :).

With ~18.5M sketches and an historical with a single core I see:
- An HLL8 sketch queried with lgK 6 takes around 2.5seconds to complete.
- A hyperunique sketch takes around 1.6seconds to complete.
I collected what I'm using fairly ad-hoc from a stream of data so I'm not sure. I expect the distribution to be fairly even but this is something I can investigate more.
I haven't specifically looked at accuracy but I'm seeing both give sketches give answers within 0.4% of each other.

Best regards, Dylan

leerho commented 5 years ago

@Dylan1312 Ok, this is helpful, and thank you for performing these experiments!

1) The Druid HLL sketch has a fixed size of LgK = 11. So in all fairness, we should be using the same size / accuracy parameter for both sketches. This will make the merge time for the DS-HLL sketch a little bit slower still.

2) The HLL8 configuration should be the fastest of the three and HLL4 the slowest. But the HLL4 is the smallest (similar in size to the Druid-HLL) and HLL8 is twice the size.

3) The DS-HLL sketch does not take ByteBuffer directly, so more than likely there is a step that performs a Memory.wrap(ByteBuffer, ByteOrder) and then passes it to the sketch that performs the HllSketch.wrap(Memory). This wrap is just a read-only view into the ByteBuffer. Our friend Leventov wants to eventually replace ByteBuffer with Memory and at that point the extra conversion will be unnecessary and will save time.

4) There is another subtle difference between the DS-HLL and the Druid-HLL in that the DS-HLL was designed to grow gradually in its use of memory. I.e., it starts out in sparse mode and then grows gradually until it converts to dense mode. Meanwhile, the Druid-HLL starts full size. It only uses sparse representation when exporting the sketch. This might give the Druid-HLL a speed advantage on merge, but I have not characterized that to see if it really does. Nonetheless, when I have some time, I would like to do an experiment and see if I added the capability for the DS-HLL sketch to also start full-size, that it might improve the merge performance.

5) The DS-HLL sketch also has the capability to not only operate with different LgKs, but the Union also allows merging between different LgKs and types: HLL4,6,8. So it is potentially doing a lot more work.

Finally, I'm not surprised that it is a little slower than the Druid-HLL sketch, but at least you can be guaranteed that it will produce results within the specified confidence bounds. The Druid-HLL sketch is already known to produce huge errors under certain conditions and with no warning.

After all, what is speed worth if you can't rely on it to produce correct results? :)

Cheers, Lee.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

[Discuss] Replacing hyperUnique as 'default' distinct count sketch #6814