cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

kv: block less on intent resolutions, reduce transaction contention footprint #22349

Open tbg opened 6 years ago

tbg commented 6 years ago

This is a brain dump on improving command queue congestion where caused by intent resolutions. I have no concrete data to pinpoint that as a problem, though I suspect it can be. In lots of workloads, the number of intent resolutions scales like the number of writes, so there can be a lot of them.

While an intent resolution is in flight, we block concurrent reads and writes to the affected keys. This seems overly strict. For example, consider a read that today conflicts with an in-flight resolution that attempts to COMMIT the value. Today, the read will patiently wait for the value to be resolved (eating a few WAN latencies) and then read it. But if we didn't enforce that, the read could sneak under the resolution. It would see the intent, which does not cause an anomaly, but is kind of pointless because it will then try to resolve the intent again. But imagine opportunistically keeping a map (inflight intent resolution's transaction id -> inflight request cache) around. Upon seeing the intent, the read could query that map and decide what the outcome will be once the intent is resolved (note that the existence of the inflight intent resolution alone is proof that this outcome will eventually be achieved).

This is even more relevant with large ranged resolutions, such as introduced in https://github.com/cockroachdb/cockroach/pull/21078, where we put large spans in the command queue that block many reads that aren't even going to see any intents.

Something similar is potentially possible for write requests running into intents, but it's tricky (and I'm much less optimistic that it's a win) because there is write skew in emitting the proper stats update: Naively you would think the write could just do the intent resolution itself and then do its write, but the real intent resolution may have evaluated as well too, and the two need to synchronize and decide who will emit the MVCCStats update. The simple way to deal with is to prepare the write under the assumption that the intent resolution applies, and then submit the write with a bit of below-raft logic that checks that there isn't an intent. But this verges into the territory of allowing general pipelining of writes (i.e. not blocking in the command queue) and that would be a much bigger win.

Perhaps it's useful here to have two command queues, where one is only used for intent resolutions and is queried on demand (i.e., when a write runs into an intent, check the intent command queue. When it's a read, don't.)

cc @nvanbenschoten for thoughts.

Jira issue: CRDB-5860

nvanbenschoten commented 6 years ago

I like the idea of maintaining a map from inflight intent resolution txn id to resolution status. Whenever we see an intent we can simply check this to determine whether we can safely ignore the intent or not.

The way I've pictured this in the past is that we'd actually merge this logic into the command queue itself because this gives us a convenient synchronization point and allows us to make certain guarantees that would be impossible with an opportunistic synchronized map. On each command queue cmd, we would add an intentResolutionStatus roachpb.TransactionStatus field along with a txnID for intent resolution requests. These fields would only be populated for intent resolution requests. Reads would no longer mark commands with a status of COMMITTED or ABORTED as dependencies, but would instead build up an interval tree of overlapping intent resolutions when scanning over overlapping requests (IntervalTree<(interval, txnID), status>). Let's call this the inflightIntentResolutionTree for now.

This tree would be passed all the way down with the read request to the MVCC layer, where it would be consulted in buildScanResults for each intent observed. The question would be what to do when we see an intent that overlaps this tree. First, we'd make sure we have a matching transaction in the inflightIntentResolutionTree. If we don't, we handle the intent as usual, returning a WriteIntentError. If we do, we check whether the status is COMMITTED or ABORTED. On COMMITTED we'd return the first version of that key. On ABORTED, we'd return the second. This would require us to change MVCCScan to return the first two versions of a key when it runs into an intent, instead of no versions like it does currently.

With respect the writes, I agree with you that the proposal "verges into the territory of allowing general pipelining of writes". For now, let's just focus on reads.

nvanbenschoten commented 6 years ago

Nothing is going to happen here for 2.1. Moving to 2.2.

andreimatei commented 3 years ago

Is this discussion still relevant, now that we have a separate lock table? We now discover more locks at once, and we also resolve more intents in batches. But I'm not entirely sure how that changes the calculus here. cc @sumeerbhola

sumeerbhola commented 3 years ago

I think what is discussed above is trying to reduce the contention footprint, analogous to what is discussed in https://github.com/cockroachdb/cockroach/issues/41720#issuecomment-549911840 once we are fully converted to separated intents. It isn't implemented yet, and nor have we fully implemented discovering more locks at once (we discover multiple locks only if they are in the in-memory lock table data-structure), but the plan is to do these when interleaved intents are in the past.