cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

Support for the modification of a column family #68890

Open cjireland opened 3 years ago

cjireland commented 3 years ago

Whilst we would like to think that all design considerations are taken into account when designing a schema sometimes it is necessary to refactor in order to get the best out of a CRDB cluster.

Typically the default column family is accepted during implementation when there are optimisations that can be performed. One such optimisation is to move columns that are not updated (but may also be large e.g. BINARY) to a separate family. This ensures that the version produced by an UPDATE is kept small.

Of course we can scope down the gc.ttlseconds to keep on top of these old versions but sometimes that isn't always the best solution (consider BACKUP frequency). We might also consider putting the large columns in a separate table but this then creates more work for both INSERT and SELECT for a business entity (and anyway we have column families!).

The concept of a column family is very powerful but currently we cannot amend a family or indeed create a new one from existing columns. It is only possible to put a new column into a new family.

Describe the solution you'd like

Over time it may become apparent that a column family needs to be refactored so support should be provided to:

  1. Remove a column from a family (without dropping it!)
  2. Add an existing column to an existing family;
  3. Add a new column to an existing family (noted that we can already add a new column to a new family);
  4. Remove an entire family without removing the columns.

Describe alternatives you've considered In order to refactor the current solution is to create a new table with the new families defined and then copy the data over from the existing table. Whilst this will work it is cumbersome for large tables and adds complexity to CD/CI pipelines.

Jira issue: CRDB-9345

ajwerner commented 3 years ago

Changing column families requires backfilling the entire table. There's no getting around that. Internally you should think of a column as having a column family and that not being able to change. It's very similar to a column having a type. Of course, we can add syntactic sugar to make it easier to refactor these things in terms of statements. We can also improve the ability to make these changes atomically and without downtime.

The biggest problem I see today is cockroach's lack of transactional schema changes. Ideally changing a column family could be expressed as the following. However, this sadly will not work today. We are hard at working realizing that vision but it's going to be a while.

CREATE TABLE foo (i INT PRIMARY KEY, j INT FAMILY "j");
BEGIN;
ALTER TABLE foo RENAME COLUMN j TO old_column;
ALTER TABLE foo ADD COLUMN j INT FAMILY "primary" DEFAULT (old_column);
ALTER TABLE foo ALTER COLUMN j DROP DEFAULT;
ALTER TABLE foo DROP COLUMN old_column;
COMMIT;

What will work is:

CREATE TABLE foo (i INT PRIMARY KEY, j INT FAMILY "j");
ALTER TABLE foo ADD COLUMN new_column INT FAMILY "primary" NOT VISIBLE AS (j) STORED;
ALTER TABLE foo ALTER COLUMN new_column DROP STORED, 
    ALTER COLUMN new_column SET VISIBLE,
    RENAME COLUMN j TO old_column,
    RENAME COLUMN new_column TO j,
    ALTER COLUMN old_column SET NOT VISIBLE;
ALTER TABLE foo DROP COLUMN old_column;

This may or may not be worse than creating a new table depending on the secondary indexes in play.


Anyway, this is a nice placeholder for future syntax related to column families though I caution you not to expect any movement on that front any time soon.