[Proposal] Fixed-shard table sampling

takaaki7 commented 9 months ago

I'd like to introduce a table sampling feature. Specifically, it would be beneficial if we could perform sampling with a fixed shard number. This would allow us to calculate the distinct count of data related to the key of the secondary partition with high accuracy.

For instance, consider a table storing user activity data on a website as follows:

Schema: __time, user_id, event_name, ...
Partition: day
Secondary partition: type=hash, key=user_id, shard_num=100

In normal sampling, functions like COUNT(DISTINCT user_id) doesn't work well. If we were to randomly select 50% of the rows from the table, the probability that at least one row with a particular user_id is included in the sample is not close to 50%.

However, what if we targeted only shards 0-49 out of the 100 segments? Given that the hash function generally distributes data uniformly, specific user_id may being part of the sample with a roughly 50%.

Consider the following query, for example:

SELECT 
COUNT(DISTINCT user_id)
FROM table TABLESAMPLE(percentage=50, type=fixed_shard)
WHERE '2023-10-01T00:00:00Z' <= __time AND __time <= '2023-10-03T00:00:00Z'

In the case where the segment files exist as:

<2023-10-01T00:00:00Z>_0
<2023-10-01T00:00:00Z>_1
<2023-10-01T00:00:00Z>_...
<2023-10-01T00:00:00Z>_100
<2023-10-02T00:00:00Z>_0
<2023-10-02T00:00:00Z>_1
<2023-10-02T00:00:00Z>_...
<2023-10-02T00:00:00Z>_100

The idea is to target only <2023-10-01T00:00:00Z>_<0-49> and <2023-10-02T00:00:00Z>_<0-49>.

cryptoe commented 8 months ago

@takaaki7 could you please clarify what In normal sampling, means in the context of druid. count distinct uses sketches and then merges the data across all the segments. Where does sampling come into play ?

I do not understand the requirement of this feature as of now. Maybe if you could give some concrete usecases that would help clear my doubts.

takaaki7 commented 8 months ago

"Normal sampling" referred to the feature of general sampling as found in other databases. For instance, BigQuery offers the following functionality: https://cloud.google.com/bigquery/docs/table-sampling

Druid does not yet have such a feature.

I would like the typical sampling features found in systems like BigQuery, but what I especially want is sampling with specific shards because it's useful for COUNT(DISTINCT user_id) (user_id is just a example of sharding key).

Consider a method for calculating the number of unique users (UU) using a table that accumulates user behavior logs.

Assume that the table contains 10 users, each user has 10 rows(Total rows = 10*10=100).

user_id	event
user1	event1
user1	event2
user1	...
user1	event10
...	...
user10	event10

And we want to use sampling target rows to improve performance. We might consider randomly scanning 10% (=10 rows) of the table's entries, calculating the number of distinct users, and then multiplying that number by 10 to get an estimated result. However this method wouldn't give us an accurate estimation, mostly it gives overestimated result.

The most favorable scenario is if all 10 randomly chosen rows belong to rows of just 1 user, which would mean multiplying the count by 10 would give us the correct result.

But the likelihood of this scenario occurring is very rare, the probability is 10 / 100C10 = 10 / 17,310,309,456,440 (nCm means combination count of n and m.)

Probabilities for other scenarios of distinct user counts can be calculated, as follows: (While I'm not entirely confident...) P(uu=2) = (10C2 (20C10 - 2)) / 100C10 P(uu=3) = (10C3 (30C10 - 3C2 * 20C10)) / 100C10 ...

The probability of having 5 or fewer unique users in the sample is around 10%. So random sampling(I referred to as 'normal sampling') is not useful for counting unique user.

So my proposal is, if rows were sharded by user_id, by targeting a specific shard for sampling, we could obtain a reliably approximate value.

apache / druid

[Proposal] Fixed-shard table sampling #15140