Closed mgartner closed 1 week ago
- Allow the number of histogram buckets to be configurable.
I think this is already covered by https://github.com/cockroachdb/cockroach/issues/72418
- Collect heavy hitter and their counts separately from histogram values.
I propose we make this issue specifically about collecting heavy hitters.
I propose we make this issue specifically about collecting heavy hitters.
Done.
Store heavy hitter separately from the histogram and somehow incorporate them during statistics building.
Count-min sketches seem like a good candidate for this.
A combination of increasing the number of histogram buckets and the improvements to histogram sampling in #125345 should mitigate this. Moving to the backlog until we have evidence that more improvements are needed.
When the optimizer calculates the selectivity of an equality filter (e.g.
a = 1
) with a value that is contained in the range of a histogram bucket, it assumes that the values in the histogram bucket are evenly distributed. For example, given these histogram buckets for the columnc UUID
:The optimizer would calculate the selectivity of the filter
c = '15000000-0000-0000-0000-000000000000'
to be1,000,000 / 10,000 = 100
rows.If the distribution of values within buckets is very uneven, then we can vastly underestimate row counts. For example, imagine that within the second bucket above, there are actually 300,000 rows with a value of
'15000000-0000-0000-0000-000000000000'
, and the other 700,000 rows are evenly distributed among the 9999 other distinct values in the bucket - our estimate of 100 rows is off by 3000x. This has come up a few times recently in real-world deployments.We should collect heavy hitters. We could either:
upper_bound
values in histograms. The correspondingnum_eq
will be a more accurate estimate of the cardinality of heavy hitters.Epic: CRDB-34173
Jira issue: CRDB-10787