cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

sql: remove interleaved tables #52009

Closed jordanlewis closed 3 years ago

jordanlewis commented 4 years ago

Interleaved tables are a mixed bag. Their theoretical pros are:

In reality, they have a large cons list:

We should remove interleaved tables altogether from CockroachDB, and that's what this issue tracks.

Primary Key Changes gives us an offramp for people who use interleaved tables. The procedure is that you should alter the primary key of all of the child interleaved tables to no longer be interleaved, one by one, until there are no interleaved children left.

Here's a proposal, using input from the schema team and others:

Epic: CRDB-1582

nvanbenschoten commented 4 years ago

the code today doesn't actually coalesce the writes of these bulk inserts into a single batch anyway, so you get no actual performance gain

This isn't quite true. Even if writes aren't coalesced into a single consensus round or even a 1PC transaction, there is still a large benefit to having all of a transaction's intents on the same range. In that case, intent resolution is heavily optimized. The KV layer is able to resolve all of a transaction's intents and clean up its transaction record in a single consensus round. This reduces the amount of load on the system and reduces the transaction's contention footprint.

ajwerner commented 4 years ago

the code today doesn't actually coalesce the writes of these bulk inserts into a single batch anyway, so you get no actual performance gain

This isn't quite true. Even if writes aren't coalesced into a single consensus round or even a 1PC transaction, there is still a large benefit to having all of a transaction's intents on the same range. In that case, intent resolution is heavily optimized. The KV layer is able to resolve all of a transaction's intents and clean up its transaction record in a single consensus round. This reduces the amount of load on the system and reduces the transaction's contention footprint.

This is a great point. My counterpoint pushes more towards the future and how changes on the horizon may mitigate the importance of this optimization.

Today it is quite important that we issue intent resolution rapidly after transactions are committed in order to reduce the contention footprint of transactions. One proposal which has been proposed in order to reduce the contention footprint is the idea that we could separate the in-memory process of intent resolution from the durable process of intent resolution. In that world we could imagine waiting much larger to build batches of intents for resolution mitigating the overhead of individual command evaluation and replication.

The above point still does depend on RPCs on the order of the number of ranges involved but we might, in that world, under relatively high load (the case where this matters), be able to issue even fewer intent resolution batches.

andreimatei commented 4 years ago

I'd be curious to see some join speed numbers for different join operators for interleaved tables vs non-interleaved and non-collocated leaseholders. One area that I think we need to make progress on is control over leaseholder collocation for correlated data. An interleaved table is one option for collocating, and I feel like it'd be a pity to lose it and not replace it with anything. I think that the latency of all sorts of things jumps a lot when one goes from 1 node to 3 nodes because the collocation is lost, and it kinda sucks. Yugabyte, for example, lets you collocate different tables (although, in their case, it's ostensibly for reducing the number of ranges, not for join performance).

ajwerner commented 4 years ago

Yugabyte, for example, lets you collocate different tables (although, in their case, it's ostensibly for reducing the number of ranges, not for join performance).

My reading here is that this only works if you don't split.

I'd be curious to see some join speed numbers for different join operators for interleaved tables vs non-interleaved and non-collocated leaseholders.

This seems like a requirement before moving forward here.

ajwerner commented 4 years ago

@awoods187 have you gotten any fresh insights on user perspectives here?

awoods187 commented 4 years ago

Two users are fine with removing this provided we give them enough time, we show them a path of it, and performance doesn't suffer. I have one big remaining user to talk that I'm waiting on time to be scheduled.

awoods187 commented 4 years ago

Updating the issue--i'd like to see a series of performance investigations that seek to disprove our hypothesis that the same performance can be achieved without interleaving before we remove them. We need to take into account active workloads, inserts, deletes, etc.

jordanlewis commented 3 years ago

Closing the loop, we've benchmarked interleaved tables and found that they indeed provide only negligible performance improvements. We'll be moving forward with the plan to remove interleaved tables in the coming releases.

abernard commented 3 years ago

Respectfully, I am very disappointed this feature was removed. It's probably going to be a breaker for me using cockroachdb for my use case. I suppose it's good this happened before implementation.

The actual problems listed are not inherent design flaws, they're implementation issues. Meanwhile, the benchmarks in issue #53455 are far too small to test anything useful. 10M rows is, to put it mildly, not a large number. 1B rows, on a normal Postgres instance or what have you, is still not a large number.

Where this issue is going to be a serious problem is when you have FK joins across large tables. Interleaving effectively removed the FK join, because consistency could be guaranteed within a key range like a NoSQL database. It could also be guaranteed with the niceties of SQL typing.

Removal of this feature makes CockroachDB an entirely different database from what was advertised. Interleaving allowed data integrity and schemas to be colocated in data space. That this was not the case (of which I'm still not convinced because the benchmark is 3 orders of magnitude too small to be relevant) is unfortunate.

Features like this one are precisely the reasons one would use a reliable distributed database like CockroachDB. If we are being honest, Postgres is far more performant than CockroachDB, but it has issues scaling once you go > 1TB. Even there, solutions like Aurora will let you scale out and they still have the FK join problem because the low-level storage model can't colocate tuples across tables. On large node counts, the FK join becomes even a larger problem, and I don't see that addressed. The Google Doc is locked so I am unable to see the results if that was tested.

To summarize:

1) The benchmark tests something 2-3 orders of magnitude smaller than what would be feasible on a single Postgres node. 2) The benchmark does not test quantities of data that would require a scalable database 3) The benchmark does not test the impact of this feature under high node counts 4) This makes the use case of CockroachDB as a typed, SQL-aware yet NoSQL storage model intractible. (It can be done with multiple hops now, but at the expense of data integrity... once again, a core marketed feature of CockroachDB. Multi-round hops to simulate FK joins are probably the largest source of data problems I've seen in my career).

ajwerner commented 3 years ago

Let's work together to digest your concerns. There's a few things here that I don't really understand. I hope I can convince you that the thing which removed doesn't prevent cockroach from providing the wonderful things you seem to be looking for, and, also that it wasn't really adding much.

Since this issue was filed, we've worked with a number of customers which had used this feature to de-interleave. None have encountered any performance or scalability problems. Most saw improvements.

All that being said, of course access locality is a good thing and of course improving it would be nice. The feature was removed largely because of its implementation being very heavy for both customers and implementers while providing not a lot of benefit. There were multiple problem with the interleaving implementation:

So, my pitch is that if we're going to do something magical to co-locate data accessed together, then we should do it in a way that transparently benefits lots of use cases and doesn't infect tons of layers with its concerns. The design I have in mind presently for that is https://github.com/cockroachdb/cockroach/issues/65726.

abernard commented 3 years ago

I appreciate the response. I will explain, and I hope I am wrong.

I find myself confused by language related to "multi-round hops". Any time cockroach could previously have performed the "interleaved join" it today can (and does) parallelize the scans to separate indexes (and these scans happen at a consistent snapshot).

Specifically, let's say you have a few tables. products, options, listings. This is an ecommerce example where products have lots of different parameters, and you sell them on multiple markets that have different properties. It is not a hypothetical example.

Let's say each of those tables is 1TB. For my purposes, this would be at least an order of magnitude larger. But let's start there. It would also be more than a dozen typed tables with joins, not just 3.

From my understanding, if I want to have FK joins on those 3 tables, it has to do a lock, go fetch results from 3 different ranges, and then combine them. Notably, these 3 different 256mb ranges (or index ranges) do not need to be on the same server. With interleaving, they clearly are (absent the exception the interleaved rows exist over a range boundary). With the proposed changes, if I have 12 joins, I have 12 hops. With interleaving, at least as how I understand it, if I wanted to hydrate all data over the "ancestor" PK the foreign keys are all joined on, which is product in this example, it's just a seq-scan on disk on one server. This is the Dynamo model where I deliberately am willing to give up performance on filtering over many indexes; instead I'd like scans or bulk writes under a PK to be optimized.

This gets into my comment about "data integrity" being compromised. It's not that cockroachdb is compromised, it's that if I want to do anything sophisticated to speed up that 12-hop scenario of this tree of data, I have to do my own error prone fetching. I can't have FKs in practice because they're not colocated. If I want it to be consistent, I have to use my own locks and block on write. Maybe here, just doing a SELECT FOR UPDATE type of lock on the parent is fine, and I can get away with the performance hit for consistency. But in general, I think that Dynamo type sweep of the "tree", not rows, is useful.

This is the first time I've heard anybody talk about Aurora as having a join problem due to its storage model. I've heard complaints about big joins there because you can't scale the compute on the master node and you can't scale the read replicas all that wide (16, right?). Do you have anything you can cite about storage access locality being a problem in Aurora?

I have my own experience but I can explain why this is problematic and why almost no one uses FKs at scale. Which is why it's so neat that CockroachDB does!

Aurora works at the physical storage level of disk. It is not a fully logically sharded database like something like Citus or Vitess. If I have a table A and I have a table B, Aurora has no way of colocating rows in common from A and B. It simply cannot know because the way it shards tables does not use the logical data model. It then has to use locks across the data shards on different servers to maintain consistency. In a storage analogy, it's the equivalent of a SAN, whereas a sharded database would be like a NAS. This is one of many use cases where Aurora is actually slower than MySQL/Postgres proper. Most organizations that end up on Aurora loosen restrictions on their queries and use it for point lookups, simple range scans, or slower ETL jobs. Anything in between a point lookup and an ETL can have interesting performance problems.

A lot of the benefit of interleaving comes from the ability to write to a single location. Reads matter, but it's possible writes matter more. The problem is, in SQL, it's painful to make a single statement that issues writes to more than one table. You generally use separate statements in the same transaction.

I agree with this. For my use case, when I tested interleaved columns with cockroachdb and a real-world schema/problem over a year ago, it was very nice to do with reads though. It was much nicer than what would be required with something like Dynamo.

I'm going to explore what the performance around this is, but I think my intuition on FK scalability here is still correct. The more tables you have in FKs without interleaved storage, the more locks you're going to need around ranges. Without telling the storage to colocate tuples, there's no guarantee that some of those locks won't be terribly inefficient. If I have varying cardinalities of some of these ranges sharing PK prefixes, my intuition is the low cardinality items' ranges and indexes are going to hold up queries for everyone else.

I will look at your proposal. 👍🏻

ajwerner commented 3 years ago

I mostly don't see how scale is all that relevant. I see two ways it might come up.

Both of these are legitimate concerns. However, the old interleaved implementation would also be issuing the same 12 requests. It does have the advantage that they'll all be going to the same node and reading off the same disks so their likelihood of being slow is likely to be correlated.

With the proposed changes, if I have 12 joins, I have 12 hops.

Just to re-iterate, it's not 12 sequential hops, it's 12 concurrent requests.

From my understanding, if I want to have FK joins on those 3 tables, it has to do a lock, go fetch results from 3 different ranges, and then combine them.

Reads in cockroach are fully optimistic (no locking required). In the case where you're equi-joining these three tables on a parameter, the database will scan all three in parallel. If it was not an equi-join on a query parameter, then the interleave would not have helped you.

abernard commented 3 years ago

Both of your assumptions are true for my problem. In my experience, this tail latency is highly non-linear. I am aware it's 12 concurrent requests, but the fanout element is important.

I'll break down some real numbers on this problem, which is ecommerce.

In ecommerce, it is common for merchants to have 10s of thousands to millions of products. For these products, it is common to have 100-500 "variants", and for them to sell it in 3-10 stores or channels.

This means on a single product, you could have 500 10 = 5K rows. A single customer in run-of-the-mill ecommerce software can have 5K 1M = 5 billion rows. SaaS companies serve tens of thousands to millions of customers. And on disk, which generally doesn't affect the indexes, row width for fields can be absurd (Amazon's product catalog has over 25K optional fields for a single product and >10K categories).

However, the old interleaved implementation would also be issuing the same 12 requests.

So this I suppose is where my confusion lies. I do not understand the CockroachDB data model well enough, but it seems to me that the join part of interleaved tables should almost always be on the same server, because it's almost always within the same range. This would require only a lock on a single server, concurrent index scans on that server, and returning back the results. Without interleaving, it seems highly unlikely the ranges for the joins are on the same server, which would require a lock that's open for ~3 orders of magnitude longer over the network. Furthermore, it doesn't seem likely that the lock efficiency (edit: or fanout efficiency, even with optimistic locking) is linear w.r.t. either (1) table join count or (2) server count. The following scenarios:

(a) 3 table joins on 3 nodes (b) 12 table joins on 3 nodes (c) 3 table joins on 12 nodes (d) 12 table joins on 12 nodes

Scenario (d) has ~16x as much concurrent fanout during the period of the locks, and 4x as much time spent in network locks. This is why I am skeptical that the data size for the benchmarks tests is representative to see the contention and blocking of ranges of a real production system. (I have seen a real production system, costing millions of dollars, fail on this problem at only about ~3x fanout and only about 1TB of data.)

assuming the slowness is independent.

This the most precise way of describing why I believe that the benchmark is flawed. Query time can't be independent of data size or join count, or even linear. If I have 8 cores, I cannot dispatch 12 concurrent requests. Whether it's just because of cardinality differences on ranges, or the number of requests that need to be concurrent with limited core count, or the way increasing node count increases fragmentation of (definitionally non-independent!!!) joins across servers, this is at least a quadratic problem.

ajwerner commented 3 years ago

Can you teach me more about what you mean when you talk about "network locks"?

this is at least a quadratic problem.

Can you lay out for me a bit more concretely what becomes quadratic?


Anyway, on the whole, I do agree with you that if we collocated the the data for the overlapping portions of the indexes then we could schedule the join processor by the data and that would be good. Collocation is good. We would like to do something about it, eventually. The more data we can produce about the lack of locality being a problem, the more inclined we'd be to do something about it.

abernard commented 3 years ago

The more data we can produce about the lack of locality being a problem, the more inclined we'd be to do something about it.

Let me just start by thanking you for that. What follows is not be trying to be disagreeable but describing a problem I see with this and why interleaved tables seemed like Santa Claus to me. Today I found out 🎄 was not happening.

Can you teach me more about what you mean when you talk about "network locks"?

"locks" the way I'm using it is abused and imprecise. I understand the optimistic locking of cockroachdb to a degree. Most of these problems don't have to do with worrying about something like pessimistic 2PC over the network.

The easiest way to summarize it though would be that under modest load, you're (1) going to get serialized behavior, (2) that serialized behavior becomes much worse with join count, and (3) it becomes far worse even than that with node count because of both network latency and needing to ultimately ensure write consistency.

If I have 2 user requests a second, 4 joins per query, and 4 cores on a server, during the duration of the request I already have 8 concurrent reads (whether dispatched sync or async) and we're down to 50% or less speed of what's theoretically possible. Namely, it seems like on an interleaved table, for range scans or something like it, a query planner could decide to forgo concurrency on the local range entirely in lieu of scanning the parent PK alone. When you have 10K request/s and 32-128 nodes or something, this is material. Amdahl's law will make parallelization assumptions wrong in the real world.

Worse, with non-interleaved tables distributed across the network, even optimistic locking is going to have issues with modest write loads. To completely make up a term, the "intersection" of possible write conflicts is O(d * j) where d is the number of data ranges and j is the number of joins. Interleaving has the virtue of making sure this intersection size is almost always O(1). Scaling up data size and nodes will make the intersection approach the limit of O(d * j_max).

The way I normally get around this problem with relational databases is partitioning, but most of them are still not good with partition-wise joins. Citus was nice because you get shard locality and partition-wise joins, so your "intersection" size is divided by shard count and then divided by partition count. My understanding of what CockroachDB's interleaving was is that it's even more local than that: a join within a single record. It seemed to be Dynamo but with schemas and no locking downsides. Dynamo, having been developed for the use case of ecommerce I mentioned, was the pattern I thought this emulated the best.

ajwerner commented 3 years ago

The easiest way to summarize it though would be that under modest load, you're (1) going to get serialized behavior, (2) that serialized behavior becomes much worse with join count, and (3) it becomes far worse even than that with node count because of both network latency and needing to ultimately ensure write consistency.

I just don't think this is true unless you're operating under overload. There are some assumptions here about lack of parallelism that I don't buy. I agree with you that if you have more load incident on the system than CPU resources available in the system, the lack of locality will lead to outsized effects on latency because of the likelihood that you encounter these resource bottlenecks. However, that's not a common operating mode of cockroach, and, to deal with it, we're investing in admission control. As you back away from perfect CPU saturation, the picture improves dramatically. Nobody should be running cockroach at sustained, full CPU saturation.

If I have 2 user requests a second, 4 joins per query, and 4 cores on a server, during the duration of the request I already have 8 concurrent reads (whether dispatched sync or async) and we're down to 50% or less speed of what's theoretically possible.

This would be, more or less, true, if it were the case that all of the work to be done to service these requests involved on-core CPU time. However, that's not at all the case. Some of the work involved relates to network IO and some relates to disk IO. In both of those cases, the way cockroach is written, we'll achieve parallelism.

Worse, with non-interleaved tables distributed across the network, even optimistic locking is going to have issues with modest write loads. To completely make up a term, the "intersection" of possible write conflicts is O(d j) where d is the number of data ranges and j is the number of joins. Interleaving has the virtue of making sure this intersection size is almost always O(1). Scaling up data size and nodes will make the intersection approach the limit of O(d j_max).

I think this one is the point you raise that I'm most interested in. Read-write conflicts are not a probabilistic thing, they happen because of workload behavior. The likelihood of a given statement hitting a write lock is roughly exactly the same whether the table is interleaved or not. There's a good point in here but I'm not sure it's as dire as you're making it out to be. I think what you're saying is that if there's a write-read conflict, when the reader discovers the conflict, work needs to be done to resolve it (generally by waiting for the writer). What you're pointing at is that if all of the writer's locks are in one place and the reader is reading in just one place, then there's just one request to deal with that conflict. Now, imagine that our writer has laid down j locks and and our j readers encounter all of these locks. Now we'll need to send j requests to deal with these locks. That's true.

However, if every reader is encountering locks (the implication of O(d * j)) you're already liable to be having a bad time even if that work was O(1). If it's relatively rare that a reader encounters a lock, then this is still a constant factor and seems fine. The especially good news is the lightweight nature of these requests and the ability of cockroach to exploit parallelism. These waiting requests don't cost much and we batch them up as well as single-flight them. By single-flight I mean that no node will have more than one outstanding request to resolve locks for a given writing transaction at a time. That optimization matters a lot.

Also, this problem is solved with the issue to collocated overlapping index data into a single replicated state machine as proposed in #65726.


Let me pose my response differently. What benchmark most worries you? A 12-way equi-join with what data sizes for each table and what sort of read-write mix? Maybe we can assuage your fears best with a benchmark.

abernard commented 3 years ago

Let me pose my response differently. What benchmark most worries you?

My most recent comments were to give my experiences and a use case in the event it is helpful to you. I was a software engineer who had to maintain code and feel your pain on needing to remove something if it isn't working. I wish the best with efforts on the data colocation front, and I will still be using cockroachdb for my project. The use of colocated tables with something like Dynamo query patterns + schema is one I am not aware of being done with any quality. If it could be done, I would be very excited.

In respects to benchmarks though, I just don't think I've seen a benchmark yet (not just here) that tests joins in a way that would give me confidence. CockroachDB is a beast on the TPC-C benchmark with point lookups/inserts, and it does have some joins, but it's not the complexity that would convince me naive joins are theoretically close to what is achievable with an interleaved table.

Read-write conflicts are not a probabilistic thing, they happen because of workload behavior. The likelihood of a given statement hitting a write lock is roughly exactly the same whether the table is interleaved or not.

This is the part that confuses me as well. So perhaps it's worth me explaining in more detail. Whether the probability is the same is the case, I don't believe it should be the case if interleaving is done as I have seen it done before. The interleaving docs show something that hints at a logical storage model very similar to leveldb. If we're looking at the interleaved tables, and we assume all the data for a "customer" (let's just use an example) is 100MB, it almost certainly is all in that one range on one server. Assuming a worst case "lock" of the whole key range, it's still theoretically O(1) of sorts. I can theoretically lock the range and modify the data in it without any contention outside that range or network coordination.

I don't see how this is possible at all if there's joins. In my real 12-way-join scenario, I'm going to need to consult up to 12 ranges, and do coordination. If I have an orders table, and I update a record in that range, and I'm doing a select join over customers, orders, products, etc., every in-flight query/mutation touching the relevant tuples needs to coordinate as well. While my language of "locks" is imprecise because it implies pessimism and I know cockroachdb has some nifty and innovative optimistic strategies, we still have multiple servers involved in a distributed systems problem. These costs, once again, are orders of magnitude larger over the network than a theoretical in-memory lock. With 10 million records that were tested, is the benchmark even using more than a few ranges?

To put it in a more fundamental language, joins are objectively "complex" in the Rich Hickey connected vs unconnected sort of way. They cannot be "independent" mathematically. When you interleave, they can be, because you make the data under the PK prefix "simple" and completely owned. Instead of set intersection (<-- complex), you have what amount to fields or properties.

If CockroachDB has join costs w.r.t. data size, number of servers, number of ranges that are the same as a range scan over a single key prefix, it would have accomplished a feat no other database has ever managed, even ones with no network coordination.

Postgres is the best I am aware of by far, and here are benchmarks for it. MySQL is quadratic and quite bad, and here are some examples for it. As an anecdote, last year I had to work on a single, relatively simple 3 way join query on MySQL where each table was roughly 1B rows. That query went from 1 hour to over 12 hours with around a 50% increase in table size, and it was executed on a beastly 128-core machine.

I just don't think this is true unless you're operating under overload. There are some assumptions here about lack of parallelism that I don't buy.

On this we'll just have to agree to disagree minus a benchmark. I'm aware of how IO bound vs CPU bound workloads work, but there are only so many threads that can be executing at once, and only so many IO streams as well. My example was just to say that it's literally impossible to have 12 way parallelism on a 8 core machine. The CPU cannot do it. I agree you can send a lot of work to I/O to come back later, but even there naive parallelism is going to saturate you quickly. Back to the interleaved example though: an interleaved range scan requires no parallelism at all. The reason it requires no parallelism is because it has no objective complexity and it is not a join (or should not be in a sufficient implementation). This was the same reason for Dynamo's flat keyspace and single-table query design, although they made different tradeoffs w.r.t. consistency.

ajwerner commented 3 years ago

If I have an orders table, and I update a record in that range, and I'm doing a select join over customers, orders, products, etc., every in-flight query/mutation touching the relevant tuples needs to coordinate as well.

Do you agree or disagree that we would not want to interleave products underneath customers? Like we could interleave orders under customers to efficiently get the set of orders for a customer, but we wouldn't want to interleave products under customers because product data doesn't have a customer as part of its primary key.

abernard commented 3 years ago

Do you agree or disagree that we would not want to interleave products underneath customers?

I agree with this statement. My above statement where I put products in the list of affected tables is erroneous.

I'll give a different example where we could theoretically interleave products even if we didn't: accounts. products does have account or tenant as part of the primary key. This is not an anomalous type of data model either. In my example products are not shared or related between accounts; most SaaS companies use this model for how they shard databases.

I don't want to do that. I like relational databases and want to use relational modeling wherever it makes sense. But this is where interleaving provides real power. It creates branch cuts in your relational graph to simplify it. For me, all of products, variants, listings and a grab bag of metadata tables does belong directly under products, specifically through a (account_id, product_id) complex key. Large chunks of this ~5K rows associated per (account_id, product_id) row potentially need to be modified together any time anything changes on that product. It is also the common case that most of these 5K rows need to be returned for any read operations that are user-facing. It's a very bulky, unwieldy bit of data that has no business polluting other tables, especially since certain tenants have orders of magnitude more product data than others.

Much like I would like to isolate accounts/tenants via traditional sharding to a certain database via account_id, I'd like key spaces involving products to remain isolated between accounts for resource utilization purposes as well. Interleaving allows this. Yes, cockroachdb will transparently route and do a scatter-gather in parallel fashion across nodes to fuse this data together, but it doesn't have to. I know my data model, and I don't want what I consider a type of ~1MB-5MB uber-row under product to be related to anything else but that single id and query pattern.

saurik commented 2 years ago

What follows is not be trying to be disagreeable but describing a problem I see with this and why interleaved tables seemed like Santa Claus to me. Today I found out 🎄 was not happening.

FWIW, I awkwardly had this exact same experience, despite the feature having already been long removed by the time I even learned of its existence: I was trying to evaluate whether I should bother with CockroachDB, and in the process read an article (one about online index updates) that mentioned this feature in passing.

As I had been spending the day carefully planning out a manual data layout for my product in a more tradition key value store I immediately "got it" and I was like "OMG I AM SOLD: THESE PEOPLE THOUGHT OF EVERYTHING". And then I put it on my todo list to figure out how to use the feature, and thankfully noticed it wasn't there anymore only a mere few (excited) hours later.

One is an assumption that every query will need to go to disk; the data sizes are large and the distributions are uniform rather than power-law. In that case, the lack of locality means that there's a multiplicative factor in the join depth on the IO cost per join.

For me, it comes down to this, FWIW: I have a large amount of data and I don't want to pay to keep almost any of it in RAM constantly. I thereby want to be able to ensure that the data "inside" of my objects (ones which I don't need to do range queries on, of course) is stored in the same SST as the parent, allowing me to do a single disk I/O to read everything I need for my display.

This was the kind of careful balancing act I was doing with my manual data layout, why I was concerned that CockroachDB's SQL approach was going to abstract me away from being able to solve it in an efficient manner, why I was so excited to see that the product had considered this use case and provided a syntax for it, and why I am so disappointed to see the feature has been removed :(.

I thereby agree that unless your test is being done under an actual load situation--which you seem to be calling "overload" but I am going to merely call "cost effective" (as if you make the whole thing more expensive to operate by requiring more RAM or slower to use with more I/O then with whatever level of over-capacity one needs that will scale up together)--it isn't a valid test on the performance.

If CockroachDB has join costs w.r.t. data size, number of servers, number of ranges that are the same as a range scan over a single key prefix, it would have accomplished a feat no other database has ever managed, even ones with no network coordination.

I was really feeling this also <3. :(

(FWIW, my memory is that Cassandra actually has something similar to this with their "super columns" feature. I know they have been downplaying it a lot since the move to CQL, but I learned them back in the early days and if you know the join is going to happen and the data you are interleaving is small they served a purpose.)

(I added some thoughts in #53455, btw, as I am curious as to the discrepancy with prior benchmarks that showed performance benefits.)