cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.77k forks source link

sql: consider using a unique encoding tag for column families #112702

Open dt opened 11 months ago

dt commented 11 months ago

In almost every release for the past several years, we have released or nearly released at least one bug that caused KV splits to be created between two column families of the same row. Since two column families of the same row are, at the kv-level, separate keys, this is easy to do: they are two separate, logically distinct and, at the KV-level, unreleased keys, so when performing kv-level splits there is little or no indication that they need to remain in the same range. This invariant -- that two keys that are part of the same row be in the same range -- is imposed only by higher-level SQL code.

One option might be to eliminate this assumption and make the SQL-level code tolerate reading a row that has column families in two different ranges. There are, however, performance considerations that have discouraged us from pursing this: requests that read the last row of a range would potentially need to contact the next range to be sure that there are no additional column families, or requests which mutate a row could become multi-range requests, missing the 1pc optimization. Some of these could potentially be partially or wholly mitigated, for example by including information in the response that indicates when a request read the last kv of a range, such as the range end-key, that could allow a caller to avoid needing to query the next range, and given our larger range sizes, the number of rows that would actually fall at this range boundary would be too small to meaningfully impact p99s. However when previously considered removing this assumption has been deemed less desirable than continuing to try to maintain it, even when that has proven to be difficult and bug-prone.

Making this assumption easier to enforce, and check, is complicated. Column families are currently encoded using two var-ints appended to the end of the row key -- one with the actual family ID, and then a second, if the ID was non-zero, encoding the byte length of the ID (which is always one byte).

We have code -- EnsureSafeSplitKey -- that given a key tries to return a key that is not a column family key, by looking to see if it has two var-ints at its end and if so, removing them. However this is unreliable: A downside of the current column family encoding is that these two varints are indistinguishable from user-data, i.e. integer columns. For example, a row-key of region='us-east', item_position=3, item_category=1 might be encoded as /Table/101/1/us-east/3/1 but this is also the encoding of the row-key for column family 3 of region=us-east. This means that one cannot tell, when looking at the byte string /Table/101/1/us-east/3/1 if it is a row key, or the key of a specific column family within the row. This causes problems for EnsureSafeSplitKey:

Given the unreliability of EnsureSafeSplitKey, it also is not included in the API to split a range -- AdminSplit -- in a way would act as a failsafe or guardrail against callers sending bad splits. AdminSplit has no way to know if a proposed split is valid or invalid -- according to the higher level SQL logic that says a split between two column families is invalid -- since it cannot tell if a given key is or is not between two families (for example, consider our previous /Table/101/1/2023/2 example).

One potential solution to this problem would be to use a new, distinct encoding for column families that is not used by any other data type encoding. This would allow the split creation facility to verify all candidate split keys do not include a column family, by stepping through each piece of the key using PeekType and PeekLength until it reaches the end; if PeekType for any component of the key -- even one that has more bytes after it -- says it is of the column family marker type, it is known to be a bad split, without needing to know there schema or whether or not EnsureSafeSplitKey was called.

One challenge to using a new encoding is performance: the varint encoding has reserved a large number of values of the encoding byte so that it is able to represent several smaller integer values directly via the choice of marker byte, without needing additional value bytes. Typical column family IDs, which start at 0 and are mostly bounded by the number of columns in a single table, as well as the length of such IDs, fit in these small, marker-represented values, and thus most column families are resented by single-byte varints. A new marker byte type would however need to use a separate varint byte.

One optimization is to reserve a separate marker byte for column family 0 -- far and away the most common one -- from column family > 0, so that the zero family continues to be representable by a single marker byte (just now CF0 instead of Int0). In this schema, a column family key for a table with many families might look like /Table/101/1/us-east/david/3/1/CF, where the column familyID is 3, the length of that ID is 1, and then the CF marker byte sits immediately after the CF length. For a common table with a single zero-ID family, it would look like /Table/101/1/us-east/david/CF0 where the CF0 marker byte -- distinct from the CF byte -- takes the place of both the family ID and length.

Another challenge of this approach would be that existing indexes cannot use a new encoding: uniqueness checks depend on their being a single canonical row key byte string for a given tuple of values that make up the key, and that byte string includes any encoded column family information. Thus this new column family encoding would need to be used in and only in newly created indexes, that exclusively use it. This is relatively easily achieved by persisting the choice to use it in the descriptor. As long as some new indexes are using the new encoding in testing, they should allow detection of code paths that can result in sending back splits, allowing those paths to be fixed even for older indexes that cannot use the new encoding.

Along these lines, this new encoding could potentially be enabled only for testing clusters and still deliver the majority of the value in that it would allow loudly and quickly identifying paths that send bad splits.

Jira issue: CRDB-32558

dt commented 8 months ago

I played around a little with this and it might not be that messy to introduce a new encoding tag for col families?

https://github.com/cockroachdb/cockroach/compare/master...dt:cockroach:wip/splits?expand=1

mgartner commented 5 months ago

Moving to 24.2 bucket for now—let's chat about this as a possible quality-related project.