cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

opt: add implicit join with global table using FK constraint to avoid scanning remote rows #91084

Open msirek opened 2 years ago

msirek commented 2 years ago

This issue is originally described towards the end of #69617.

Describe the solution you'd like Given a query on a REGIONAL BY ROW table that has a foreign key constraint on a global table, when the query contains a predicate on the key of the global table and the crdb_region column value is not specified via an equality filter, add an implicit join with the global table to potentially avoid scanning rows in remote regions. Example:

SET experimental_enable_unique_without_index_constraints = true;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE WITHOUT INDEX (account_id, crdb_region)
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    PRIMARY KEY (account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

SELECT tweet_id, message
FROM tweets
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';

This query in its current form must access all regions because the primary key of tweets is not fully specified, so locality-optimized search cannot be used. The global table can be accessed locally, and specifies the crdb_region column value of the row in tweets via the FK constraint, so transforming the query into the following avoids the unconditional scan from all regions:

EXPLAIN
SELECT tweet_id, message
FROM tweets
WHERE (crdb_region, account_id) IN
  (SELECT crdb_region, account_id
   FROM accounts
   WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8');
                                                  info
--------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 1
  │ table: tweets@tweets_pkey
  │ equality: (crdb_region, lookup_join_const_col_@1) = (crdb_region,account_id)
  │
  └── • render
      │
      └── • scan
            estimated row count: 1 (100% of the table; stats collected 8 minutes ago)
            table: accounts@accounts_pkey
            spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'6f781502-4936-43cc-b384-04e5cf292cc8']

It's still possible the crdb_region column value is a remote region, but this plan at least provides the option of skipping scans from remote regions when crdb_region is the local region, or if no row with the specified account_id exists in accounts.

A second issue is that UNIQUE WITHOUT INDEX (account_id, crdb_region) should not be required to create the foreign key constraint as the primary key column, account_id, implies the uniqueness of (account_id, crdb_region). This issue could optionally do away with this unnecessary requirement.

Describe alternatives you've considered None

Jira issue: CRDB-21100

yuzefovich commented 1 year ago

Some context.

mgartner commented 7 months ago

More context.