opt: validation queries for unique constraints on virtual columns should use a streaming group-by

When a unique constraint is added to a REGIONAL BY ROW table, a validation query is run to ensure that the columns are indeed unique. This query normally uses the memory efficient streaming group-by operator rather than a more costly hash-based group-by operator.

For example:

statement ok
CREATE TABLE regional_by_row (
  a INT,
  b INT,
  UNIQUE (b),
  FAMILY (a, b)
) LOCALITY REGIONAL BY ROW

statement ok
ALTER TABLE regional_by_row INJECT STATISTICS '[
  {
    "columns": ["a"],
    "distinct_count": 3,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["b"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  }
]'

# This query mimics the validation query executed when a unique constraint is
# added to a table.
query T
EXPLAIN SELECT b FROM regional_by_row WHERE b IS NOT NULL GROUP BY b HAVING count(*) > 1  LIMIT 1
----
distribution: full
vectorized: true
·
• limit
│ estimated row count: 1
│ count: 1
│
└── • filter
    │ estimated row count: 333,300
    │ filter: count_rows > 1
    │
    └── • group
        │ estimated row count: 999,900
        │ group by: b
        │ ordered: +b
        │
        └── • union all
            │ estimated row count: 1,000,000
            │
            ├── • union all
            │   │ estimated row count: 666,666
            │   │
            │   ├── • scan
            │   │     estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
            │   │     table: regional_by_row@regional_by_row_b_key
            │   │     spans: (/'ap-southeast-2'/NULL - /'ap-southeast-2']
            │   │
            │   └── • scan
            │         estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
            │         table: regional_by_row@regional_by_row_b_key
            │         spans: (/'ca-central-1'/NULL - /'ca-central-1']
            │
            └── • scan
                  estimated row count: 333,333 (33% of the table; stats collected <hidden> ago)
                  table: regional_by_row@regional_by_row_b_key
                  spans: (/'us-east-1'/NULL - /'us-east-1']

However, when a virtual computed column is included in the unique constraint the streaming group-by is not planned, making the validation of the constraint costly. Because expression indexes are built using virtual columns, validation queries of unique expression indexes will also be inefficient.

statement ok
CREATE TABLE regional_by_row_virt_col (
  a INT,
  b INT,
  v INT AS (a + b) VIRTUAL,
  UNIQUE INDEX (v),
  FAMILY (a, b)
) LOCALITY REGIONAL BY ROW

statement ok
ALTER TABLE regional_by_row_virt_col INJECT STATISTICS '[
  {
    "columns": ["a"],
    "distinct_count": 3,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["b"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  },
  {
    "columns": ["v"],
    "distinct_count": 999900,
    "null_count": 5,
    "row_count": 1000000,
    "created_at": "2018-01-01 1:00:00.00000+00:00"
  }
]'

# This query mimics the validation query executed when a unique constraint is
# added to a table.
query T
EXPLAIN SELECT v FROM regional_by_row_virt_col WHERE v IS NOT NULL GROUP BY v HAVING count(*) > 1  LIMIT 1
----
distribution: full
vectorized: true
·
• limit
│ estimated row count: 1
│ count: 1
│
└── • filter
    │ estimated row count: 111,111
    │ filter: count_rows > 1
    │
    └── • group
        │ estimated row count: 333,333
        │ group by: v
        │
        └── • render
            │ estimated row count: 333,333
            │
            └── • filter
                │ estimated row count: 333,333
                │ filter: (a + b) IS NOT NULL
                │
                └── • scan
                      estimated row count: 1,000,000 (100% of the table; stats collected <hidden> ago)
                      table: regional_by_row_virt_col@primary
                      spans: [/'ap-southeast-2' - /'us-east-1']

The optimizer should plan a streaming group by in this case.

Jira issue: CRDB-8911

cockroachdb / cockroach

opt: validation queries for unique constraints on virtual columns should use a streaming group-by #68173