cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

sql: percentile_disc aggregation cannot spill to disk for large bucket #97192

Open michae2 opened 1 year ago

michae2 commented 1 year ago

The percentile_disc aggregation operator keeps an entire bucket in memory at once in order to calculate a percentile value: https://github.com/cockroachdb/cockroach/blob/f91cf78f847590e2c2f9902590ea0edc1906537e/pkg/sql/sem/builtins/aggregate_builtins.go#L4559-L4562

This can use quite a lot of memory (or run into the distsql_workmem limit) if the bucket is very large.

Here's an example:

CREATE TABLE a (a INT, INDEX (a));
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
INSERT INTO a SELECT generate_series(0, 99999);
ANALYZE a;

SET distsql_workmem = '16 MiB';
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY a) AS median FROM a;

This runs into the workmem limit:

root@localhost:26257/defaultdb> SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY a) AS median FROM a;
ERROR: windower-limited: memory budget exceeded: 15989904 bytes requested, 2283520 currently allocated, 16777216 bytes in budget
SQLSTATE: 53200

As a workaround, it's possible to use SELECT ... FROM ... ORDER BY ... LIMIT 1 OFFSET (SELECT count(...) FROM ...) like so, which should be able to spill to disk if necessary:

root@localhost:26257/defaultdb> SELECT a AS median FROM a ORDER BY a LIMIT 1 OFFSET (SELECT count(a) // 2 - 1 FROM a);
  median
----------
   49999
(1 row)

Time: 1.072s total (execution 1.072s / network 0.000s)

Jira issue: CRDB-24566

DrewKimball commented 1 year ago

The cume_dist window operator is already implemented in the vectorized engine (which AFAIK is better about spilling to disk) - maybe we could use that as well?