improve latch documentation

taroface commented 3 years ago

Ryan Kuo (taroface) commented:

The documentation at https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html#latch-manager could more clearly differentiate between latches and locks (which together provide isolation between transactions).

Per @andreimatei here:

I think that text doesn’t make it very clear what the difference between latching and locking is. I’d try to make it more clear that latches are just held for the duration of evaluation and replication of a request, but not otherwise for the duration of the txn (in contrast to locks). A way to think about it is that latches provide mutual exclusion for accesses to the locks table - they’re like locks for locks :)

Also, from a Slack convo re: locks and latches in transaction contention:

I now think that we should write something about it, and explain what the deal is. And in a nutshell the deal is that latching and locking together provide isolation between transactions. Latching is about isolation between individual requests (so if you’ve got a read and an overlapping write evaluating at the same time, one of them needs to evaluate before the other one), and locking is about isolation at the level of a transaction - so say the write evaluated first and so its txn holds a lock (in the form of an intent); the lock persists until the txn finishes.

And so you can see latching in the context of contention. A lot of times when you see latching, you also see locking, and the locking dominates. But that need not be the case: if most of a txn’s time is spent evaluating a single request (for example a big scan), then the latching can become painfully visible. When individual scans are taking a long time, it can be a sign of storage problems (for example yesterday it was a pathologically uncompacted Pebble store).

There’s also the common special case that I was referring to above about QueryIntent requests waiting for latches. Writes hold latches while they replicate, but the client is free to issue other requests at the same time (this is “transactional write pipelining”). When the client wants to commit, it sends one of these QueryIntent requests to the key of each pipelined write. These requests wait on latches until the respective replication is done. This doesn’t mean contention, it just means that replication is taking a while (and this is expected when the client is close to the leaseholder and the transactions are short and the replication is across regions).

Jira Issue: DOC-961

taroface commented 3 years ago

@rmloveland This came up during a discussion about detecting transaction contention. Feel free to unassign or reassign!

rmloveland commented 3 years ago

Thanks for filing this @taroface !!!

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB docs!

cockroachdb / docs

improve latch documentation #9684