apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.54k forks source link

[C++][Compute] Implement count distinct kernel using HyperLogLog #29746

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Having a version of the aggregation kernel count distinct using HyperLogLog may be useful.

Note: The implementation should support the merge operator.

cc @ianmcook @lidavidm

Some resources/links: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/ https://github.com/facebookincubator/velox/tree/main/velox/aggregates/hyperloglog

Reporter: Percy Camilo Triveño Aucahuasi / @aucahuasi

Note: This issue was originally created as ARROW-14158. Please see the migration documentation for further details.

asfimport commented 2 years ago

ZMZ91 / @ZMZ91: hi there, is this feature on schedule?

asfimport commented 2 years ago

Dhruv Vats / @dhruv9vats: Is there still interest in this? If so, I'd be happy to give this a go.

Also, this will go into {}hash_aggregate{}, right? And could be named something like hash_count_distinct_estimate or {}hash_count_distinct_hll{}?

asfimport commented 2 years ago

ZMZ91 / @ZMZ91: Sure. We'd like to have a hash_count_distinct_hll for a proximate result in many real cases.

jbapple commented 1 year ago

take