MLnick / hive-udf

Approximate cardinality estimation with HyperLogLog, as a Hive function
Apache License 2.0
42 stars 16 forks source link

To merge two or more HLL sets #2

Closed yashk closed 11 years ago

yashk commented 11 years ago

We know that two or more HLL sets can be merged to get cardinality of larger set , for example if we have 31 HLL sets of unique visitor count each for one day , then we can merge them to get unique visitor count for a month. can this be supported by this UDAF ?

abramsm commented 11 years ago

Yes.

Matt Abrams

On Thursday, April 25, 2013 at 4:03 AM, yash wrote:

We know that two or more HLL sets can be merged to get cardinality of larger set , for example if we have 31 HLL sets of unique visitor count each for one day , then we can merge them to get unique visitor count for a month. can this be supported by this UDAF ?

— Reply to this email directly or view it on GitHub (https://github.com/MLnick/hive-udf/issues/2).

MLnick commented 11 years ago

To provide a bit more detail - assuming you've created a column approx as struct {type string; cardinality bigint; binary binary}, and did something like select approx_distinct(col) as approx from ..., then each row will contain the HLL structure for the relevant day. You can the do select approx_distinct(approx).cardinality from APPROX_TABLE ... to get the cardinality of the merged HLL structures.

yashk commented 11 years ago

Thanks @abramsm @MLnick will try and update

moizarafat commented 11 years ago

Do you plan to add functionality for intersection / minus between two sets.

abramsm commented 11 years ago

No. HyperLogLog intersection is a bit tricky and requires that the intersection is performed on two or more instances that have certain characteristics. Even though it is not implemented in the core library it is trivial to perform the intersection. Here is a thread on the stream-lib mailing list on this topic:

https://groups.google.com/forum/?fromgroups=#!searchin/stream-lib-user/intersection/stream-lib-user/LKagmkKYA14/ffPzy55O6JwJ

Cheers, Matt

On Tue, Apr 30, 2013 at 8:14 PM, maxwell1985 notifications@github.comwrote:

Do you plan to add functionality for intersection / minus between two sets.

— Reply to this email directly or view it on GitHubhttps://github.com/MLnick/hive-udf/issues/2#issuecomment-17262126 .

yashk commented 11 years ago

we tried the merge and it is working , thanks @MLnick @abramsm