kv,*: non-contiguous ranges

ajwerner commented 3 years ago

NOTE: this is issue a speculative exploration that I do not expect to be pursued for a long time.

Is your feature request related to a problem? Please describe.

Cockroach was designed to be a scalable database which can scale to handle large volumes of data. However, there was an assumption that the data would be scattered over a number of ranges which was (much) larger than the number tables (or, say, partitions which carry their own configuration).

The problem at hand is that a client can force the system to carry difference configurations for different spans of data. For example, imagine a client create 1000 tables, each of which is partitioned using REGIONAL BY ROW syntax and each of which contains 1 secondary index (on average) and the database contains 5 regions. In that setting, the schema would require at least 10000 (1000 2 5) ranges. This is true because each table has two indexes which each have five partitions which carry different constraints. Any adjacent key-span with a different configuration implies a need to split. This is true even if these tables contain zero bytes. The overhead to having a range is non-trivial; each range requires background work and memory. See Additional context for why this scenario is actually quite common. In practice, there are only 5 unique configurations in use throughout the entire keyspace.

Describe the solution you'd like

The proposal in this issue is that we extend the range structure to allow for a range to consist of multiple, discontiguous spans. This proposal is not without precedent; Spanner allows precisely this (even if motivated differently, see Additional context).

A directory is the unit of data placement. All data in a directory has the same replication configuration. When data is moved between Paxos groups, it is moved directory by directory, as shown in Figure 3. Spanner might move a directory to shed load from a Paxos group; to put directories that are frequently accessed together into the same group; or to move a directory into a group that is closer to its accessors.

The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from a Bigtable tablet: the former is not necessarily a single lexicographically contiguous partition of the row space. Instead, a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this decision so that it would be possible to colocate multiple directories that are frequently accessed together.

Meta layout and KV changes

The changes required to support non-contiguous ranges are surprisingly small:

1) Change the RangeDescriptor to have repeated start_key and end_key (or change it to have a repeated spans field) as parallel arrays.

2) Change the meta encoding.

Currently range descriptors are stored in two place: meta2 and a range-local replicated key. These descriptors are addressed by the start key in the range-local space and (for horrible reasons, by the end key in meta2). The value in both places is updated transactionally. If we made non-contiguous ranges, we would not want to store another copy of the descriptor to keep in sync (also it would bloat the space). Instead, we'd store, in meta, at the end key for each subsequent span (not the lowest), a pointer to the start key of lowest span.

This would mean that resolving range addressing may require one more hop in order to determine addresses, but that's fine.

I'm not sure there's much more too it. There's some bounds checking which would need to be dealt with for certain data structures (think rditer).

Allocation Decisions

The biggest problem with this whole scheme is how to choose when to merge non-contiguous spans. The merge queue today only ever considers the range immediately adjacent for merge. Now, the merge queue could consider all ranges which have a left-most span earlier than the current range that carry the same configuration.

Importantly, there would be zero involvement of the SQL layer in dictating or controlling this merging; it would be constrained to KV entirely.

There's a lot of sophistication would could be used to make very good decisions here and to achieve the goal laid out by Spanner. Fortunately, I believe there's some hope that dumb solutions will lead to good results. Namely, if all of the tables in a given region would fit into a single range, then a greedy merge solution would co-locate them. That would mean that that greedy solution might co-locate secondary index partitions with their primary index partitions. That'd be wonderful.

One risk is that we create a fragmented keyspace with lots of spans and very few ranges, which, if shuffled around could be a similarly small number of contiguous ranges. I don't have intuition on the ways in which this fragmentation is likely to arise. Perhaps some limits on the number of spans in a range would help.

Another problem is the split queue and split decision-making. Today we split far more eagerly than we need to. In cases where there are separate partitions or separate tables which are contiguous, we split them even when they carry an identical config. We hope to address this as part of project described in (https://github.com/cockroachdb/cockroach/pull/62128).

Describe alternatives you've considered

Another way to reduce the number of ranges is to page out their raft groups entirely to some sort of external storage. This can work if the data is mostly cold and not being written to.

Additional context

Schema-per-customer (sometimes called multi-tenancy)

A common paradigm for building SaaS products is to isolate individual customers with their own set of tables. We've seen this come up a number of times. Today this approach is completely stymied by the bottlenecks due to creating such a set up (https://github.com/cockroachdb/cockroach/issues/63206). Soon we'll hopefully fix these bottlenecks and unlock the possibility of clients actually telling the system to create these very large numbers of ranges. Once that happens, I anticipate we'll be exposed to a new class of problems due to having large numbers of ranges.

Interleaved tables and collocated data

In https://github.com/cockroachdb/cockroach/issues/52009 we decided to remove interleaved tables, a feature which allowed users to collocate data, and potentially, achieve serious access locality benefits from that collaction. In practice, these optimizations didn't work out well for SQL and carried quite a bit of complexity. Nevertheless, access locality is an important property. If we had a smart (likely global) placement driver, we could place segments of indexes which are accessed together in the same range.

Jira issue: CRDB-7732

bdarnell commented 3 years ago

This also reminds me of copysets: #25194.

jbowens commented 3 years ago

It seems like this would also be useful for implementing 'instant cloning' of tables or schemas. If you want to quickly make a copy of a table, adjust all of its ranges to include the same spans with the new table's prefix. Within Pebble, update metadata so that all sstables overlapping the old table's keyspace are valid for the new table's keyspace, and adjust the keys on the fly when reading from the sstable.

knz commented 3 years ago

cc @mwang1026 for tracking - we're not thinking about prioritizing this right away, but it should guide our thinking for various design discussions that are going to happen in the coming months.

ajwerner commented 3 years ago

cc @vy-ton this is ultimately going to be the next bottleneck for "schema scalability" after we clear up the zone config one.

nvanbenschoten commented 3 years ago

This also reminds me of copysets: #25194.

Do you mind expanding on this? I don't think I see the parallels.

An alternative to this idea is to re-organize the SQL table encoding to make currently non-contiguous data contiguous. This is similar to @bdarnell's proposal about inverted storage of column families in https://github.com/cockroachdb/cockroach/issues/42038. The benefit of this approach is that it avoids pushing complexity throughout KV (the Allocation Decisions complexity seems like the tip of the iceberg) and then needing to expose a richer API through KV to manipulate this new flexibility. But thinking on this more, it's also analogous to interleaved tables. So the downside with this alternative is that we re-introduce the complexity we sought to remove when we removed interleaved tables.

I understand most of the motivation here, but it hasn't quite clicked yet for me how introducing non-contiguous ranges would eliminate complexity vs. intervleaved tables, instead of just shifting that same complexity from one layer to another.

ajwerner commented 3 years ago

I understand most of the motivation here, but it hasn't quite clicked yet for me how introducing non-contiguous ranges would eliminate complexity vs. intervleaved tables, instead of just shifting that same complexity from one layer to another.

The data locality possibility is secondary for me. Like, with a smart placement driver, it's possible to get collocation. But, that's not my primary motivation. I raise it primarily because Google was raising it. However, that points at something really important; nothing breaks if the lower level does a bad job placing data accessed together in the same range. There is a downside that you might even get less ability to scan contiguous data. However, with some sanity checks and some stats and some bookkeeping about the size of each individual span, it doesn't seem so crazy to add some heuristics about when to split off a span.

The driving motivation for my typing this issue is the number of ranges, not the co-location of data.

The thing with interleaving is that it was tightly coupled to the implementation; everything needed to know about it for anything to work. I think it's that loose coupling that makes this so much better. Consider distsql and its affinity for data placement. It's nice because if we mess it up, ~nothing breaks. We never got all that sophisticated with how we do that placement, and yet, I think, it has served us quite well. Here too I think we can open the door to a loosely coupled performance optimization that might be able to provide great value to the design of cockroach long before we build a sophisticated placement driver.

I again want to highlight also that REGIONAL BY ROW tables in the same database are likely to have non-contiguous partitions that share a configuration. Even if we did absolutely nothing to encourage co-location of data based on the properties of the data itself, we'd potentially see a huge benefit.

ajwerner commented 3 years ago

Just to avoid confusion, I added more concrete language to say that the proposal here is not for SQL to dictate any merging of these non-contiguous ranges or even to provide any mechanism by which it might be able to propose such merging. This proposal would be entirely seamless to SQL. The idea is that the merge queue would be able to merge ranges that are not adjacent but do have the same config.

bdarnell commented 3 years ago

This also reminds me of copysets: #25194.

Do you mind expanding on this? I don't think I see the parallels.

A copyset is a set of ranges that have the same members. We've been thinking of this mainly through a lens of fault tolerance and correlated failures, but once you have defined groups of ranges in this way you could potentially do more radical things like decoupling the raft log from the materialized KV state and letting all the ranges in a copyset share one raft log (and one heartbeat). That starts to approach "non-contiguous ranges" from the other direction: it's multiple ranges, but we start to consolidate things that contribute to the per-range overhead.

For any variation of these ideas, I think there are a lot of empirical questions about what exactly is providing overhead in the current range implementation. A non-contiguous range is going to be more expensive in some ways than current ranges (for a trivial example, its descriptor will be larger because it has more keys). We need to make sure there is enough net savings to justify the complexity.

ajwerner commented 3 years ago

For any variation of these ideas, I think there are a lot of empirical questions about what exactly is providing overhead in the current range implementation. A non-contiguous range is going to be more expensive in some ways than current ranges (for a trivial example, its descriptor will be larger because it has more keys). We need to make sure there is enough net savings to justify the complexity.

I totally agree. One consideration is memory. This one matters, to be sure. We'll be increasing the memory footprint by at least the size of the new spans and likely MVCC stats per span. Then again, if all of those replicas were on the same host, you'd have the same size problems. I cannot think of a domain where non-contiguous replicas would have a larger memory footprint than the same number of contiguous ones. Most other data structures, will be shared and are not span oriented. The biggest savings as far as I'm concerned are the opportunities to improved batching and reduced need for goroutines for the foreground replication work and the reduction in background work for queues.

Today our worst problems have to do with O(ranges) synchronous operations. The one that most comes to mind are zone config updates via gossip events (https://github.com/cockroachdb/cockroach/issues/63206). Hopefully as we replace that system, we'll come up with something algorithmically better.

This issue should be a wait and see sort of thing. Currently our bottlenecks have more to do with # of bytes and the zone config badness. While I anticipate that the mass proliferation of the REGIONAL BY ROW tables and their automatic partitioning of secondary indexes combined with tenants is going to lead to an explosion of ranges, we'll know soon enough how pressing of a problem that ends up being.

moonsphere commented 3 years ago

maybe support non-contiguous ranges also can help TPCC? for example, it can scatter one warehouse's order records and order_line records laid in the same range and use One Phase Commit even them in different table's partitions~?

bdarnell commented 3 years ago

Another way to look at this could be to reify the raft logs themselves. Right now, there is a 1:1:1 relationship between key spans, ranges, and raft logs. This issue proposes to give each range multiple key spans, which preserves the 1:1 relationship between ranges and raft logs. What if instead we preserved the 1:1 relationship between key spans and ranges and made the relationship between ranges and raft logs many-to-one? This feels conceptually cleaner to me than non-contiguous ranges. We could remove replica membership from the range descriptors - a range descriptor would instead point to a raft log descriptor, and the raft log descriptor would contain the members. (OTOH, there's a lot of peril in that change given the privileged position of the range descriptor in a lot of our "special" transactions).

ajwerner commented 3 years ago

OTOH, there's a lot of peril in that change given the privileged position of the range descriptor in a lot of our "special" transactions.

These range descriptors are also the mechanism by which we determine the locations for pieces of data using the meta ranges. I'm having a hard time envisioning how resolving keys to nodes/stores would work in the world where the concept of a range were to become decoupled both from membership and from that indexing structure. Would the idea be that we'd put raft log descriptors in the meta tables?

Making the range conceptually something other than a state-machine replicated object driven from a single log feels like a big departure in terms of mental model.

This feels conceptually cleaner to me than non-contiguous ranges. We could remove replica membership from the range descriptors - a range descriptor would instead point to a raft log descriptor, and the raft log descriptor would contain the members.

One thing that this seems to give up on is the ability to perform 1PC on these writes. The point of the transaction protocol is to coordinate atomicity across independent state machines. If we don't have atomicity of writes, then I have to question what we're doing. I suppose we could tightly couple it all to try to make claims about atomicity between these logs due to some shared lease. I'm not convinced that that ends up being conceptually cleaner.

Stepping back, can you unpack what is conceptually unclean about re-defining a range from being a state-machine replicated set of data for a contiguous span to being a state-machine replicated set of data for a set of spans? Is it that the word range now feels wrong? I buy that, but it feels like that one can be rectified with a terminology shift; we could call them tablet instead and everything will go back to making sense.

bdarnell commented 3 years ago

Would the idea be that we'd put raft log descriptors in the meta tables?

Yes, or at least that's my initial ~half~quarter-baked idea.

If we don't have atomicity of writes, then I have to question what we're doing

Agreed. I wasn't really thinking this through and assumed 1PC writes would kind of just work but now I see that's definitely not true. The problems might be solvable but not easily.

Is it that the word range now feels wrong?

I think that's a big part of it. It also further overloads the "range" construct - the set of spans in a range and the set of replicas of the range will now all be managed by one object and interact in complex ways. I'm not sure how much splitting the object up would help with that as opposed to just moving the complexity around, though.

Anyway, this was just a half-baked idea that occurred to me when this thread resurfaced in my inbox, so no need to spend more time on it.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach