Open ajwerner opened 4 years ago
@madelineliao worked on her prototype of this before she wrapped up her internship (#52415, #52705). I'll take it over the finish line.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Is your feature request related to a problem? Please describe.
Current the
txnHeartbeater
will periodically heartbeat the transaction record of a transaction. It performs this heart beating once a second (see here) with an expiration at 5s (see here).This is heartbeating acts as a cliff for performance. As soon as transactions take longer than 1s, they perform a write, leading to more load, increasing latency, leading to more transactions sending heartbeats.
Another thing to note is that only long-lived transactions heartbeat but every transaction kicks off a goroutine to attempt to heartbeat #35009. The reason for this heartbeating is so that when another transaction encounters an intent, it can determine whether the transaction has expired and can be aborted.
A key point is that nowhere in the system do we encourage or rely on holding open long-running transactions intentionally. There are good reasons why we might want to use long-running transactions (discussed below in additional context).
@nvanbenschoten I've come around to your perspective on this issue.
Describe the solution you'd like
One approach to making this heartbeat cheaper is to batch heartbeat requests from a node to another node. This would likely fix the performance cliff. Say that each node has
T
heartbeating transactions which have transaction records held onR
ranges. Currently we'll issuesT
writes per second. With coalescing we'd move toO(R)
writes per second (it will depend on the batching settings).R
is probably okay in the cases we care about.The biggest work item to making this change would be to extend the HeartbeatTxnRequest to include the
Txn
rather than using theTxn
off of theBatchRequest.Header
. This would allow multiple heartbeats to be coalesced. At that point the https://godoc.org/github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.Another, somewhat simpler, change would be to make transaction heartbeats configurable. This change complements the above suggestion nicely. This would allow control over the heartbeat at a cluster level to be controlled with a cluster setting and then for transactions which are intentionally held open for a long time and longer timeouts can be tolerated can set their expiration even further into the future.
A further optimization would be to centralize the scheduling of the heartbeats as opposed to spinning off a goroutine per transaction (#35009).
Describe alternatives you've considered
Another option is to add a layer of indirection; instead of heartbeating transaction records, each gateway could heartbeat some other record. This would work precisely like node liveness works at the KV layer. The problems with such an approach are twofold: (1) the intent resolution protocol becomes more complicated and (2) the transactions would not learn about their having been aborted so we'd need to add another RPC to inform them, further compounding (1).
There's something nice about this approach. It eliminates the poorly handled asynchrony of the heartbeat loop and replaces it with explicit coordination. The other downside is that it moves the reads required to resolve an intent to a third location, away from the transaction record.
Lastly, it probably is the case that when there are a large number of long-running transactions, they occur on the same range. If they don't occur on the same range and there are a lot of them, then the user is probably holding it wrong and should fix something.
Additional context
Currently we don't ever really run long-running transactions as part of the normal course of business. We do however have a variety of different and bespoke leasing systems above the KV layer. These leases take two approaches: they heartbeat explicitly (SQL schema leases, table descriptor schema change leases) job leases (epoch based). Each of these have their own problems and are super bespoke - storing records and determine status based on logic that interprets fields in protocol buffers. The purpose of one of these leases is mutual exclusion (locking). It turns out we've invested quite a bit in building distributed locking in our KV layer. We should make the locking in the KV layer work for the higher level abstractions rather than continuously reinventing the wheel above the KV layer. In this vision of the world, the process of holding a lock is likely to map to the holding open a long-running transaction. Hence my desire to make long-running transactions cheaper.
Additionally of import is the desire to prevent TPC-C runs from falling off a cliff as soon as latency approaches the current 1s level.
This issue would be a really nice starter issue for a new member to the KV team.
Jira issue: CRDB-5182