Document different uses of HyperLogLog

begriffs commented 8 years ago

HLL

ozgune commented 8 years ago

A related question came up in our Slack channel yesterday: https://citus-public.slack.com/archives/general/p1466184975000275

I used the HLL workaround about a year ago. Still, I couldn't remember the answer or find the related information in our docs. We document the answer in this section in our docs: https://www.citusdata.com/docs/citus/5.1/dist_tables/querying.html#hyperloglog-column

I think the primary issue with this section is that it only gives the final answer without setting any context. That is, if I'm a relatively advanced user, I can't piece together how to run distributed queries on HLLs. I have a quick note to see if we can improve on this section by providing a complete example.

If we show an example table schema that has HLL as a data type, and then show the query that would compute the final result, this would be extremely helpful.

For example, we provide the following steps as part of our data modeling exercise. First, we need to define a raw events table and explain how one can roll-up from the raw events table into a summary_daily table using hll_add_agg(hll_hash_integer(click_engagement)). (This step is currently missing from this comment.)

Next / in conjunction, we provide an example schema for the summary_daily table.

CREATE TABLE summary_daily (
    unique_id  BIGINT not null,
   ...
    report_date DATE not null,
    campaign_id BIGINT not null,
    impression_total HLL not null,
    click_engagement_count HLL not null,
   ...
);

We then create an aggregate function as defined in our documentation on the master and worker nodes.

CREATE AGGREGATE sum (hll)
(
  sfunc = hll_union_trans,
  stype = internal,
  finalfunc = hll_pack
);

After that step, we could show an example query and explain what how this query gets executed on the master and worker nodes.

SELECT
    campaign_id,
    hll_cardinality(SUM(impression_total)),
   ...
FROM summary_daily
WHERE report_date BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY 1, 2, 3;

ozgune commented 8 years ago

We recently rolled out our first technical solution to our documentation. @lithp documented a more complete example of using HLLs in this technical solution. We could copy/paste that as a separate section (or on top of our current section) to our docs.

ozgune commented 7 years ago

+2 -- We recently had two customers who asked about these instructions. I wanted to make a note of it here.

ozgune commented 7 years ago

@begriffs / @sumedhpathak -- I'm planning to drop topN from this issue and narrow it down to HLLs. This recently came up in an email thread (Re: Responding to your Citus Question -- HyperLogLog).

We currently have three ways in which users can rely on HLLs:

They need to run count(distinct non_partition_column) and the query errors out. In this case, we don't need to mention HLLs to the user and can just ask them to set the related config key (#78). @byucesoy's first blog post also covers this topic: https://www.citusdata.com/blog/2017/04/04/distributed_count_distinct_with_postgresql/
We have a customer who keeps data in HLLs. They then run a query that spans across multiple shards. The query needs to aggregate HLLs from different shards. This happens infrequently and it's mentioned in my previous comment in this issue.
We have a customer who rolls up data that uses HLL. For example, they roll-up data at one hour, one day, or one month intervals. Each roll-up table has an HLL column. This happens the most frequently. We currently cover these types of roll-ups in two places.

Our Use Case guide has a dated section on HLL roll-ups: https://docs.citusdata.com/en/v6.2/use_case_guide/real_time_dashboards.html#approximate-distinct-counts
We have a blog post that talks about rolling up using INSERT INTO ... SELECT: https://www.citusdata.com/blog/2017/06/30/efficient-rollup-with-hyperloglog-on-postgres/

We also have an issue to update the Use Case guide to reflect Citus' new roll-up capabilities (#198).

aamederen commented 7 years ago

Our Use Case guide has a dated section on HLL roll-ups: https://docs.citusdata.com/en/v6.2/use_case_guide/real_time_dashboards.html#approximate-distinct-counts

This came up from a customer recently (#433)

begriffs commented 7 years ago

Reviewing the docs it looks like we have item 1 covered:

https://docs.citusdata.com/en/v7.0/dist_tables/querying.html#count-distinct-aggregates

To cover item 3 we could modify our example of rollups to use HLL rather than bigint. Or could add a subsection showing that HLL is possible as well.

I could modify our tiny HyperLogLog section to cover item 2 in a non-rollup situation. Would a simple example suffice with running the custom sum aggregation on a non-distribution column?

ozgune commented 7 years ago

hey @begriffs -- I think the answer might be more involved than that.

I'll try to put myself into a new user's shoes.

User profile 1: I don't know anything about HLL. I want to run count(distinct). For that, I need to follow the instructions in https://docs.citusdata.com/en/v7.0/dist_tables/querying.html#count-distinct-aggregates. @mtuncer / @metdos are working on making count(distinct) work. We can therefore move or remove this section with Citus 7.1.

User profile 2: I heard about HLL. I search the Citus documentation and I come across this section: https://docs.citusdata.com/en/v7.0/dist_tables/querying.html#hyperloglog-column.

I'd be confused at this point because this section doesn't provide any context about my use case.

I may want to store my data in HLL columns and run queries across HLL columns. OR,
I may want to roll up using INSERT INTO SELECT.

These two use cases are different, but I don't know which one applies to me. Also, there isn't an example table schema or even an example query that shows me how to run a query. So, I'm lost.

User profile 3: I heard about HLL. I search Citus documentation and I come across our Use Case Guide: https://docs.citusdata.com/en/v7.0/use_case_guide/real_time_dashboards.html#approximate-distinct-counts

In this case, I still have the issue that my use case may not be the one described in the use case. This guide provides more context about the use case and gives examples. The guide however is completely dated and misinforms the user about how to set up HLLs.

If I were working on this feature, I'd approach the problem in one of the following ways:

Revise the Use Case guide to be current (it's dated by about a year). This way, 50%+ of people who search for HLL will find what they are looking for. Others will know that this isn't what they want.
Extend the section on Postgres extensions or create a separate section on Approximation Algorithms. Provide examples for the two use cases here.
Confirm with @metdos that we're getting count(distinct) into Citus 7.2. If we are, we could also move the count(distinct) / limit approximation advice in https://docs.citusdata.com/en/v7.0/dist_tables/querying.html to the Approximation Algorithms section.

citusdata / citus_docs

Document different uses of HyperLogLog #41