cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.87k stars 3.77k forks source link

mini-rfc: Non-primary column families #42038

Open bdarnell opened 4 years ago

bdarnell commented 4 years ago

This idea was inspired by brainstorming around partitioning and the effort to close the gap between primary and secondary indexes (#41989). I don't have a concrete use case for this yet so I'm just writing it up briefly to see if there's any interest in pursuing this idea further.

Column families allow the primary index (and currently only the primary index) to be divided into multiple KV pairs. This has two main benefits: fine-grained latching for reduced contention (useful at least for YCSB), and reduction in write amplification (especially if there are infrequently-updated blob columns). However, it's kind of a complex and subtle special case for something so rarely used. I propose replacing this with a generalization of storing indexes.

In the new model, a table with two column families would have two indexes instead of a single primary key. each of these indexes would have the same key columns, but "store" a different subset of the table's columns. This would change the constructed key from /$TABLE/1/$PK/$FAMILY to /$TABLE/$INDEX/$PK/0, and place columns from different families far apart from each other. This means that single-row operations are no longer guaranteed to be single-range, which is a downside if you often operate on the entire row, but could be a benefit if you usually operate on parts of the row at a time (which is exactly the time when column families make sense). The benefit would be especially useful in the "blob" use case, since the non-blob column family would be denser with real data. A "free" side effect is that column families would become targets for zone configs, so you could store your blobs on cheaper storage (and maybe this could be a step towards column-level security that goes all the way through the KV layer)

This model gets more interesting if we generalize it from "two half-primary keys" to "every column must be stored in at least one index" (and more subtly, there must be paths to look up every column given a PK). This allows for columns in different families to even be partitioned differently (for example to make some columns available for follower reads in other regions while other columns are replica-partitioned to have faster writes in the home region).

This model appeals to me on a theoretical level because it removes the "special case" of column families in place of a generalization of the relationship between tables and indexes. However, it also introduces a lot of new complexity in the form of complex relationships between indexes and invariants that need to be preserved. I think I've mostly talked myself out of this idea since I haven't been able to come up with use cases that it would help, but I wanted to write it down for posterity and see if anyone else was inspired by it.

Jira issue: CRDB-5398

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!