Open mgartner opened 1 year ago
I believe the integer case is faster because only 1000 spans are generated in the worst case, not 1 million. This is because the values in the IN filters form a contiguous range, unlike in the STRING case, so there are 1000 spans like:
[/1/1 - /1/1000]
[/2/1 - /2/1000]
[/3/1 - /3/1000]
...
In the string case, the 1 million spans look like (quote omitted);
[/1/1 - /1/1]
[/1/2 - /1/2]
[/1/3 - /1/3]
...
[/2/1 - /2/1]
[/2/2 - /2/2]
[/2/3 - /2/3]
...
I believe the integer case is faster because only 1000 spans are generated in the worst case, not 1 million.
Confirmed.
[triage] Since we end up picking a full table scan anyway, one option is to add logic similar to the costing logic which decides that above a certain number of spans a full scan is better anyway. We could give up halfway through GenerateConstrainedScans after building a certain number of spans.
Or we could just use one of the IN sets for building constraints.
Consider the table and query below:
The filters in this query cause the optimizer to explore a constrained scan with 1 million spans - one for each combination of the 1000 values of
k
and the 1000 values ofa
. Theses 1 million spans stress algorithms within the optimizer that have time complexity ofO(number of spans)
, which significantly slows down planning time.CPU and memory profiles highlight significant inefficiencies when filter histograms. As one example, when calculating the selectivity of each span to be unioned together, we copy the column's histogram for each constraint:
https://github.com/cockroachdb/cockroach/blob/57abe808178987f4acf2c87d549e2eed1909eca5/pkg/sql/opt/memo/statistics_builder.go#L855
Some potential improvements here could be:
Jira issue: CRDB-26709