opt: plan local scans for REGIONAL BY ROW tables with computed region

rytaft commented 3 years ago

As described in #57682, we will enable customers to specify a computed column expression for the region based on other columns in their schema. If the columns that the computed column depends on are constrained by a query, we should be able to constrain the computed column as well and ensure local scans for many queries.

For example, consider the example of TPC-C. The schema might be specified as CREATE TABLE warehouse (w_id INT PRIMARY KEY, ...) REGIONAL BY ROW AS (CASE WHEN w_id < 5 THEN 'us-east1' ELSE 'us-west2' END). As described in #57718, the PRIMARY KEY designation will be syntactic sugar for PRIMARY KEY (region, w_id), UNIQUE WITHOUT INDEX (w_id). If a user submits the query SELECT * FROM warehouse WHERE w_id = 3, the optimizer should be able to infer an additional filter, region = 'us-east1', allowing the primary index to be constrained.

This issue covers the work to ensure that the optimizer infers the additional filter on the computed column when all of its dependent columns are constrained.

Epic CRDB-13964

Jira issue: CRDB-3492

nvanbenschoten commented 3 years ago

Is any work needed for this? I tested out an old suite of tests compiled for hash-sharded indexes yesterday and was pleased to find that we now support all desired optimizations.

Here's an example of the behavior I believe this issue is talking about at work:

demo@127.0.0.1:26257/movr> create table computed (b bool as (i % 2 = 0) stored, i int, primary key (b, i));
CREATE TABLE

demo@127.0.0.1:26257/movr> explain select * from computed where i = 15;
                info
------------------------------------
  distribution: local
  vectorized: true

  • scan
    missing stats
    table: computed@primary
    spans: [/false/15 - /false/15]
(7 rows)

demo@127.0.0.1:26257/movr> explain select * from computed where i = 14;
               info
----------------------------------
  distribution: local
  vectorized: true

  • scan
    missing stats
    table: computed@primary
    spans: [/true/14 - /true/14]
(7 rows)

rytaft commented 3 years ago

Yea, I'm hoping that this will "just work" for the most important cases. At the very least, I want to add test cases for some multi-region-specific examples. One other thing that may be possible is to infer a range/set of values rather than a constant. Using your example from above:

demo@127.0.0.1:26257/movr> create table computed (b bool as (i % 2 = 0) stored, i int, primary key (b, i));
CREATE TABLE

Time: 3ms total (execution 3ms / network 0ms)

demo@127.0.0.1:26257/movr> explain select * from computed where i = 15;
                info
------------------------------------
  distribution: local
  vectorized: true

  • scan
    missing stats
    table: computed@primary
    spans: [/false/15 - /false/15]
(7 rows)

Time: 11ms total (execution 10ms / network 1ms)

demo@127.0.0.1:26257/movr> explain select * from computed where i = 15 OR i = 13;
               info
----------------------------------
  distribution: full
  vectorized: true

  • filter
  │ filter: (i = 15) OR (i = 13)
  │
  └── • scan
        missing stats
        table: computed@primary
        spans: FULL SCAN
(10 rows)

Time: 1ms total (execution 0ms / network 0ms)

The second case is unconstrained, but we should be able to constrain it two two spans. I'm not sure how important of a use case this is, but if we can avoid fanning out to all regions in cases like this I think it would be valuable.

nvanbenschoten commented 3 years ago

The second case is unconstrained, but we should be able to constrain it two two spans. I'm not sure how important of a use case this is, but if we can avoid fanning out to all regions in cases like this I think it would be valuable.

I agree that supporting this would be valuable.

At the very least, I want to add test cases for some multi-region-specific examples.

Adding test cases for all of these different queries and doing so using multi-region schemas sounds like a great idea!

rytaft commented 3 years ago

It should be relatively straightforward to infer a range/set of values rather than a constant, if this is deemed a priority. It does somewhat complicate the code, though, and it's not clear that this is an important use case, so I'm not going to try to get it into 21.1. #61304 shows that the most important use case already "just works".

github-actions[bot] commented 10 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

rytaft commented 7 months ago

still relevant

cockroachdb / cockroach

opt: plan local scans for REGIONAL BY ROW tables with computed region #57722