sql: consider using a unique encoding tag for column families

In almost every release for the past several years, we have released or nearly released at least one bug that caused KV splits to be created between two column families of the same row. Since two column families of the same row are, at the kv-level, separate keys, this is easy to do: they are two separate, logically distinct and, at the KV-level, unreleased keys, so when performing kv-level splits there is little or no indication that they need to remain in the same range. This invariant -- that two keys that are part of the same row be in the same range -- is imposed only by higher-level SQL code.

One option might be to eliminate this assumption and make the SQL-level code tolerate reading a row that has column families in two different ranges. There are, however, performance considerations that have discouraged us from pursing this: requests that read the last row of a range would potentially need to contact the next range to be sure that there are no additional column families, or requests which mutate a row could become multi-range requests, missing the 1pc optimization. Some of these could potentially be partially or wholly mitigated, for example by including information in the response that indicates when a request read the last kv of a range, such as the range end-key, that could allow a caller to avoid needing to query the next range, and given our larger range sizes, the number of rows that would actually fall at this range boundary would be too small to meaningfully impact p99s. However when previously considered removing this assumption has been deemed less desirable than continuing to try to maintain it, even when that has proven to be difficult and bug-prone.

Making this assumption easier to enforce, and check, is complicated. Column families are currently encoded using two var-ints appended to the end of the row key -- one with the actual family ID, and then a second, if the ID was non-zero, encoding the byte length of the ID (which is always one byte).

We have code -- EnsureSafeSplitKey -- that given a key tries to return a key that is not a column family key, by looking to see if it has two var-ints at its end and if so, removing them. However this is unreliable: A downside of the current column family encoding is that these two varints are indistinguishable from user-data, i.e. integer columns. For example, a row-key of region='us-east', item_position=3, item_category=1 might be encoded as /Table/101/1/us-east/3/1 but this is also the encoding of the row-key for column family 3 of region=us-east. This means that one cannot tell, when looking at the byte string /Table/101/1/us-east/3/1 if it is a row key, or the key of a specific column family within the row. This causes problems for EnsureSafeSplitKey:

it cannot be idempotent: calling it on /Table/101/1/2023/2/2/1 (col family 2 of row 2023/2) will return /Table/101/1/2023/2, but if it is called again, it will chop off /2023/2 this time, resulting in just the whole index start key.
Consequently, when called on keys that have already had the column family removed, or never had one in the first place, it can misbehave, either erroneously removing two user data varints if they look like column family specifiers, or returning an error.
Due to the above, errors from EnsureSafeSplitKey are regularly ignored -- for example, if one has a regionally partitioned table, /table/101/1/us-east is potentially a great split key, as it split an entire partition. Further, it is a safe split key as it does not separate the column families of a single row. But, since this key does not have two varints as a suffix, EnsureSafeSplitKey does not know what to do with it and returns an error. Callers like RESTORE generally want to make this split and ignore the error.
it cannot detect if there are bytes after the columns family, e.g. if /Table/101/1/2023/2/2/1 was passed to PrefixEnd or Next, and thus has, for example, an added NULL, it will not have a valid varint pair suffix, so EnsureSafeSplitKey cannot tell what to do with it and returns an error. Combined with the aforementioned ignoring of errors, this can lead to keys that are absolutely not safe split keys -- keys that contain a column family ID -- being used to split, since EnsureSafeSplitKey cannot read them and errors but its errors need to be ignored.

Given the unreliability of EnsureSafeSplitKey, it also is not included in the API to split a range -- AdminSplit -- in a way would act as a failsafe or guardrail against callers sending bad splits. AdminSplit has no way to know if a proposed split is valid or invalid -- according to the higher level SQL logic that says a split between two column families is invalid -- since it cannot tell if a given key is or is not between two families (for example, consider our previous /Table/101/1/2023/2 example).

One potential solution to this problem would be to use a new, distinct encoding for column families that is not used by any other data type encoding. This would allow the split creation facility to verify all candidate split keys do not include a column family, by stepping through each piece of the key using PeekType and PeekLength until it reaches the end; if PeekType for any component of the key -- even one that has more bytes after it -- says it is of the column family marker type, it is known to be a bad split, without needing to know there schema or whether or not EnsureSafeSplitKey was called.

One challenge to using a new encoding is performance: the varint encoding has reserved a large number of values of the encoding byte so that it is able to represent several smaller integer values directly via the choice of marker byte, without needing additional value bytes. Typical column family IDs, which start at 0 and are mostly bounded by the number of columns in a single table, as well as the length of such IDs, fit in these small, marker-represented values, and thus most column families are resented by single-byte varints. A new marker byte type would however need to use a separate varint byte.

One optimization is to reserve a separate marker byte for column family 0 -- far and away the most common one -- from column family > 0, so that the zero family continues to be representable by a single marker byte (just now CF0 instead of Int0). In this schema, a column family key for a table with many families might look like /Table/101/1/us-east/david/3/1/CF, where the column familyID is 3, the length of that ID is 1, and then the CF marker byte sits immediately after the CF length. For a common table with a single zero-ID family, it would look like /Table/101/1/us-east/david/CF0 where the CF0 marker byte -- distinct from the CF byte -- takes the place of both the family ID and length.

Another challenge of this approach would be that existing indexes cannot use a new encoding: uniqueness checks depend on their being a single canonical row key byte string for a given tuple of values that make up the key, and that byte string includes any encoded column family information. Thus this new column family encoding would need to be used in and only in newly created indexes, that exclusively use it. This is relatively easily achieved by persisting the choice to use it in the descriptor. As long as some new indexes are using the new encoding in testing, they should allow detection of code paths that can result in sending back splits, allowing those paths to be fixed even for older indexes that cannot use the new encoding.

Along these lines, this new encoding could potentially be enabled only for testing clusters and still deliver the majority of the value in that it would allow loudly and quickly identifying paths that send bad splits.

Jira issue: CRDB-32558

cockroachdb / cockroach

sql: consider using a unique encoding tag for column families #112702