cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

opt: validation queries for unique constraints on virtual columns should use a streaming group-by #68173

Open mgartner opened 3 years ago

mgartner commented 3 years ago

When a unique constraint is added to a REGIONAL BY ROW table, a validation query is run to ensure that the columns are indeed unique. This query normally uses the memory efficient streaming group-by operator rather than a more costly hash-based group-by operator.

For example:

statement ok
CREATE TABLE regional_by_row (
  a INT,
  b INT,
  UNIQUE (b),
  FAMILY (a, b)
) LOCALITY REGIONAL BY ROW

statement ok
ALTER TABLE regional_by_row INJECT STATISTICS '[
  {
    "columns": ["a"],
    "distinct_count": 3,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["b"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  }
]'

# This query mimics the validation query executed when a unique constraint is
# added to a table.
query T
EXPLAIN SELECT b FROM regional_by_row WHERE b IS NOT NULL GROUP BY b HAVING count(*) > 1  LIMIT 1
----
distribution: full
vectorized: true
·
• limit
│ estimated row count: 1
│ count: 1
│
└── • filter
    │ estimated row count: 333,300
    │ filter: count_rows > 1
    │
    └── • group
        │ estimated row count: 999,900
        │ group by: b
        │ ordered: +b
        │
        └── • union all
            │ estimated row count: 1,000,000
            │
            ├── • union all
            │   │ estimated row count: 666,666
            │   │
            │   ├── • scan
            │   │     estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
            │   │     table: regional_by_row@regional_by_row_b_key
            │   │     spans: (/'ap-southeast-2'/NULL - /'ap-southeast-2']
            │   │
            │   └── • scan
            │         estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
            │         table: regional_by_row@regional_by_row_b_key
            │         spans: (/'ca-central-1'/NULL - /'ca-central-1']
            │
            └── • scan
                  estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
                  table: regional_by_row@regional_by_row_b_key
                  spans: (/'us-east-1'/NULL - /'us-east-1']

However, when a virtual computed column is included in the unique constraint the streaming group-by is not planned, making the validation of the constraint costly. Because expression indexes are built using virtual columns, validation queries of unique expression indexes will also be inefficient.

statement ok
CREATE TABLE regional_by_row_virt_col (
  a INT,
  b INT,
  v INT AS (a + b) VIRTUAL,
  UNIQUE INDEX (v),
  FAMILY (a, b)
) LOCALITY REGIONAL BY ROW

statement ok
ALTER TABLE regional_by_row_virt_col INJECT STATISTICS '[
  {
    "columns": ["a"],
    "distinct_count": 3,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["b"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["v"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  }
]'

# This query mimics the validation query executed when a unique constraint is
# added to a table.
query T
EXPLAIN SELECT v FROM regional_by_row_virt_col WHERE v IS NOT NULL GROUP BY v HAVING count(*) > 1  LIMIT 1
----
distribution: full
vectorized: true
·
• limit
│ estimated row count: 1
│ count: 1
│
└── • filter
    │ estimated row count: 111,111
    │ filter: count_rows > 1
    │
    └── • group
        │ estimated row count: 333,333
        │ group by: v
        │
        └── • render
            │ estimated row count: 333,333
            │
            └── • filter
                │ estimated row count: 333,333
                │ filter: (a + b) IS NOT NULL
                │
                └── • scan
                      estimated row count: 1,000,000 (100% of the table; stats collected <hidden> ago)
                      table: regional_by_row_virt_col@primary
                      spans: [/'ap-southeast-2' - /'us-east-1']

The optimizer should plan a streaming group by in this case.

Jira issue: CRDB-8911

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!