Closed yashk closed 11 years ago
Yes.
Matt Abrams
On Thursday, April 25, 2013 at 4:03 AM, yash wrote:
We know that two or more HLL sets can be merged to get cardinality of larger set , for example if we have 31 HLL sets of unique visitor count each for one day , then we can merge them to get unique visitor count for a month. can this be supported by this UDAF ?
— Reply to this email directly or view it on GitHub (https://github.com/MLnick/hive-udf/issues/2).
To provide a bit more detail - assuming you've created a column approx
as struct {type string; cardinality bigint; binary binary}
, and did something like select approx_distinct(col) as approx from ...
, then each row will contain the HLL structure for the relevant day. You can the do select approx_distinct(approx).cardinality from APPROX_TABLE ...
to get the cardinality of the merged HLL structures.
Thanks @abramsm @MLnick will try and update
Do you plan to add functionality for intersection / minus between two sets.
No. HyperLogLog intersection is a bit tricky and requires that the intersection is performed on two or more instances that have certain characteristics. Even though it is not implemented in the core library it is trivial to perform the intersection. Here is a thread on the stream-lib mailing list on this topic:
Cheers, Matt
On Tue, Apr 30, 2013 at 8:14 PM, maxwell1985 notifications@github.comwrote:
Do you plan to add functionality for intersection / minus between two sets.
— Reply to this email directly or view it on GitHubhttps://github.com/MLnick/hive-udf/issues/2#issuecomment-17262126 .
we tried the merge and it is working , thanks @MLnick @abramsm
We know that two or more HLL sets can be merged to get cardinality of larger set , for example if we have 31 HLL sets of unique visitor count each for one day , then we can merge them to get unique visitor count for a month. can this be supported by this UDAF ?