opt: collect heavy hitters in table statistics

mgartner commented 2 years ago

When the optimizer calculates the selectivity of an equality filter (e.g. a = 1) with a value that is contained in the range of a histogram bucket, it assumes that the values in the histogram bucket are evenly distributed. For example, given these histogram buckets for the column c UUID:

{
    "distinct_range": 10_000,
    "num_eq": 5000,
    "num_range": 1_000_000,
    "upper_bound": "10000000-0000-0000-0000-000000000000"
},
{
    "distinct_range": 10_000,
    "num_eq": 5000,
    "num_range": 1_000_000,
    "upper_bound": "20000000-0000-0000-0000-000000000000"
},

The optimizer would calculate the selectivity of the filter c = '15000000-0000-0000-0000-000000000000' to be 1,000,000 / 10,000 = 100 rows.

If the distribution of values within buckets is very uneven, then we can vastly underestimate row counts. For example, imagine that within the second bucket above, there are actually 300,000 rows with a value of '15000000-0000-0000-0000-000000000000', and the other 700,000 rows are evenly distributed among the 9999 other distinct values in the bucket - our estimate of 100 rows is off by 3000x. This has come up a few times recently in real-world deployments.

We should collect heavy hitters. We could either:

Store heavy hitter separately from the histogram and somehow incorporate them during statistics building.
Use the heavy hitters as upper_bound values in histograms. The corresponding num_eq will be a more accurate estimate of the cardinality of heavy hitters.

Epic: CRDB-34173

Jira issue: CRDB-10787

michae2 commented 1 year ago

Allow the number of histogram buckets to be configurable.

I think this is already covered by https://github.com/cockroachdb/cockroach/issues/72418

Collect heavy hitter and their counts separately from histogram values.

I propose we make this issue specifically about collecting heavy hitters.

mgartner commented 1 year ago

I propose we make this issue specifically about collecting heavy hitters.

Done.

DrewKimball commented 3 months ago

Store heavy hitter separately from the histogram and somehow incorporate them during statistics building.

Count-min sketches seem like a good candidate for this.

mgartner commented 2 months ago

A combination of increasing the number of histogram buckets and the improvements to histogram sampling in #125345 should mitigate this. Moving to the backlog until we have evidence that more improvements are needed.

cockroachdb / cockroach

opt: collect heavy hitters in table statistics #71828