Open begriffs opened 8 years ago
A related question came up in our Slack channel yesterday: https://citus-public.slack.com/archives/general/p1466184975000275
I used the HLL workaround about a year ago. Still, I couldn't remember the answer or find the related information in our docs. We document the answer in this section in our docs: https://www.citusdata.com/docs/citus/5.1/dist_tables/querying.html#hyperloglog-column
I think the primary issue with this section is that it only gives the final answer without setting any context. That is, if I'm a relatively advanced user, I can't piece together how to run distributed queries on HLLs. I have a quick note to see if we can improve on this section by providing a complete example.
If we show an example table schema that has HLL as a data type, and then show the query that would compute the final result, this would be extremely helpful.
For example, we provide the following steps as part of our data modeling exercise. First, we need to define a raw events table and explain how one can roll-up from the raw events table into a summary_daily
table using hll_add_agg(hll_hash_integer(click_engagement))
. (This step is currently missing from this comment.)
Next / in conjunction, we provide an example schema for the summary_daily
table.
CREATE TABLE summary_daily ( unique_id BIGINT not null, ... report_date DATE not null, campaign_id BIGINT not null, impression_total HLL not null, click_engagement_count HLL not null, ... );
We then create an aggregate function as defined in our documentation on the master and worker nodes.
CREATE AGGREGATE sum (hll) ( sfunc = hll_union_trans, stype = internal, finalfunc = hll_pack );
After that step, we could show an example query and explain what how this query gets executed on the master and worker nodes.
SELECT campaign_id, hll_cardinality(SUM(impression_total)), ... FROM summary_daily WHERE report_date BETWEEN '2015-07-01' AND '2015-07-31' GROUP BY 1, 2, 3;
We recently rolled out our first technical solution to our documentation. @lithp documented a more complete example of using HLLs in this technical solution. We could copy/paste that as a separate section (or on top of our current section) to our docs.
+2 -- We recently had two customers who asked about these instructions. I wanted to make a note of it here.
@begriffs / @sumedhpathak -- I'm planning to drop topN from this issue and narrow it down to HLLs. This recently came up in an email thread (Re: Responding to your Citus Question -- HyperLogLog).
We currently have three ways in which users can rely on HLLs:
They need to run count(distinct non_partition_column)
and the query errors out. In this case, we don't need to mention HLLs to the user and can just ask them to set the related config key (#78). @byucesoy's first blog post also covers this topic: https://www.citusdata.com/blog/2017/04/04/distributed_count_distinct_with_postgresql/
We have a customer who keeps data in HLLs. They then run a query that spans across multiple shards. The query needs to aggregate HLLs from different shards. This happens infrequently and it's mentioned in my previous comment in this issue.
We have a customer who rolls up data that uses HLL. For example, they roll-up data at one hour, one day, or one month intervals. Each roll-up table has an HLL column. This happens the most frequently. We currently cover these types of roll-ups in two places.
INSERT INTO ... SELECT
: https://www.citusdata.com/blog/2017/06/30/efficient-rollup-with-hyperloglog-on-postgres/We also have an issue to update the Use Case guide to reflect Citus' new roll-up capabilities (#198).
Our Use Case guide has a dated section on HLL roll-ups: https://docs.citusdata.com/en/v6.2/use_case_guide/real_time_dashboards.html#approximate-distinct-counts
This came up from a customer recently (#433)
Reviewing the docs it looks like we have item 1 covered:
To cover item 3 we could modify our example of rollups to use HLL rather than bigint. Or could add a subsection showing that HLL is possible as well.
I could modify our tiny HyperLogLog section to cover item 2 in a non-rollup situation. Would a simple example suffice with running the custom sum
aggregation on a non-distribution column?
hey @begriffs -- I think the answer might be more involved than that.
I'll try to put myself into a new user's shoes.
User profile 1: I don't know anything about HLL. I want to run count(distinct)
. For that, I need to follow the instructions in https://docs.citusdata.com/en/v7.0/dist_tables/querying.html#count-distinct-aggregates. @mtuncer / @metdos are working on making count(distinct)
work. We can therefore move or remove this section with Citus 7.1.
User profile 2: I heard about HLL. I search the Citus documentation and I come across this section: https://docs.citusdata.com/en/v7.0/dist_tables/querying.html#hyperloglog-column.
I'd be confused at this point because this section doesn't provide any context about my use case.
INSERT INTO SELECT
.These two use cases are different, but I don't know which one applies to me. Also, there isn't an example table schema or even an example query that shows me how to run a query. So, I'm lost.
User profile 3: I heard about HLL. I search Citus documentation and I come across our Use Case Guide: https://docs.citusdata.com/en/v7.0/use_case_guide/real_time_dashboards.html#approximate-distinct-counts
In this case, I still have the issue that my use case may not be the one described in the use case. This guide provides more context about the use case and gives examples. The guide however is completely dated and misinforms the user about how to set up HLLs.
If I were working on this feature, I'd approach the problem in one of the following ways:
count(distinct)
into Citus 7.2. If we are, we could also move the count(distinct) / limit approximation advice in https://docs.citusdata.com/en/v7.0/dist_tables/querying.html to the Approximation Algorithms section.