cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

kvstreamer: consider setting MaxSpanRequestKeys on parallel batches issued by the Streamer #67885

Open yuzefovich opened 3 years ago

yuzefovich commented 3 years ago

Once #67040 is implemented, we will have a library that performs parallel scans while adhering to memory limits. In order to simplify the discussion on the RFC, we have consciously put aside thinking about queries with LIMIT. As a follow-up task to improving the implementation/usage of the Streamer library we should revisit the cases when we have a hard or a soft limit and set MaxSpanRequestKeys on the batches whenever appropriate.

Quoting Nathan from the RFC review:

At a minimum, the Streamer can use MaxSpanRequestKeys to place upper bounds on each
individual batch of ScanRequests. Even if we assume that all other concurrent batches will
return 0 rows, this can still be useful to place an upper bound on how far we can overshoot
the limit. Without the use of MaxSpanRequestKeys, there is no limit to how far we can
overshoot. With a large enough TargetBytes and with small keys, we can pull back
thousands of unnecessary keys and scan hundreds of unnecessary ranges (especially with
many MVCC tombstones in the way). With the most conservative use of
MaxSpanRequestKeys, we can bound the amount we can overshoot to (P - 1) * limit keys.

This discussion is applicable to lookup (not index) joins, regardless of whether the lookup columns form a key and whether there is ON expression. Quoting Becca:

When lookup columns form a key, we don't know that all input rows will have matches. We
may select a larger number of input rows, but only want the top k that have matches.

When lookup columns don't form key:
- empty ON expression - we could set a hard limit on each lookup.
- non-empty ON expression - the optimizer can estimate the selectivity of the ON
  expression and determine a soft limit (or "limit hint") based on that.

Jira issue: CRDB-8760

github-actions[bot] commented 10 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!