cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.78k forks source link

opt: perform upsert fast-path if all columns in a family are specified #86907

Open fabiog1901 opened 2 years ago

fabiog1901 commented 2 years ago

When planning an UPSERT specifying all columns in the column family, the opt also plans a scan instead of doing the fast-path.

The goal is to replace an UPDATE with an UPSERT in order to save time on the read part of the Update. By putting all to-be-updated columns in its own column family, and specifying just those columns in the upsert, the expectation is to use the fast-path and avoid the scan.

root@:26257/defaultdb> create table t (
  id string not null,
  name string not null,
  age int null,
  blob string null,
  constraint "primary" primary key (id),
  family "main" (id, name, age),
  family "aux" (blob)
);

root@:26257/defaultdb> insert into t values (1, 'joe', 22, 'lots of blah blah');

There is no scan if upserting ALL columns

root@:26257/defaultdb> explain (verbose) upsert into t values (1, 'gino', 33, 'more blah blah');                                                                                                          
                                       info
----------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ columns: ()
  │ estimated row count: 0 (missing stats)
  │ into: t(id, name, age, blob)
  │ auto commit
  │
  └── • project
      │ columns: (id_cast, column2, column3, column4, column2, column3, column4)
      │
      └── • values
            columns: (column2, column3, column4, id_cast)
            size: 4 columns, 1 row
            row 0, expr 0: 'gino'
            row 0, expr 1: 33
            row 0, expr 2: 'more blah blah'
            row 0, expr 3: '1'
(19 rows)

Time: 1ms total (execution 1ms / network 0ms)

It will do a scan even though the 3 cols are contained within a family, 'main'.

root@:26257/defaultdb> explain (verbose) upsert into t (id, name, age) values (1, 'joe', 44);
                                              info
-------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ columns: ()
  │ estimated row count: 0 (missing stats)
  │ into: t(id, name, age, blob)
  │ auto commit
  │ arbiter indexes: primary
  │
  └── • project
      │ columns: (id_cast, column2, column3, blob_default, id, name, age, column2, column3, id)
      │
      └── • cross join (left outer)
          │ columns: (column2, column3, id_cast, blob_default, id, name, age)
          │ estimated row count: 1 (missing stats)
          │
          ├── • values
          │     columns: (column2, column3, id_cast, blob_default)
          │     size: 4 columns, 1 row
          │     row 0, expr 0: 'joe'
          │     row 0, expr 1: 44
          │     row 0, expr 2: '1'
          │     row 0, expr 3: CAST(NULL AS STRING)
          │
          └── • scan
                columns: (id, name, age)
                estimated row count: 1 (missing stats)
                table: t@primary
                spans: /"1"/0
                locking strength: for update
(31 rows)

Time: 2ms total (execution 1ms / network 0ms)

Jira issue: CRDB-18999

gz#14308

mgartner commented 5 months ago

I was initially worried about the insert case of the upsert because I incorrectly assumed we'd have to insert a NULL value for the other column families, which would require a read to determine if it is an insert case. But we don't insert KVs for column families that have all NULL columns, so I think we should be fine:

statement ok
CREATE TABLE t (
  a INT PRIMARY KEY,
  b INT,
  c INT,
  FAMILY (a, b),
  FAMILY (c)
)

query T kvtrace
INSERT INTO t (a, b) VALUES (2, 20)
----
CPut /Table/106/1/2/0 -> /TUPLE/2:2:Int/20
mgartner commented 5 months ago

Some additional requirements:

  1. All columns in the primary family must be specified.
  2. If any columns from non-primary families are specified, all other columns from those families must also be specified.