kv: mvcc garbage may cause a single row to be unavailable for writes

jbowens commented 1 year ago

Describe the problem

@DuskEagle mentioned this to me, and it seemed to me like something could be improved here.

If a single KV is written sufficiently many times or with sufficiently large values, eventually the row becomes unavailable to additional writes. The row's range can exceed the configured max size but be unavailable for splitting since all its data consists of versions of a single key. Incoming writes backpressure with expectation that the range can be split.

Updating a row very many times or with very large values is definitely an anti-pattern and one that users should be steered away from, but the inability to write to the row without upping range size or disabling backpressure seems like a very sharp guardrail. CC encountered this issue in CC-8421.

I would expect a gradual degradation of performance from the accumulated MVCC garbage, and potentially other performance problems from the over-sized range, but no hard loss of write availability of the row. Would we be able to avoid backpressuring writes to ranges that cannot be split?

To Reproduce

create table foo (id integer primary key, v text);
insert into foo (id, v) VALUES(0, array_to_string(ARRAY(select generate_series(1, 1000000)), ' '));
update foo set v=(select v||v from foo where id = 0) where id = 0;
update foo set v=(select v||v from foo where id = 0) where id = 0;
update foo set v=(select v||v from foo where id = 0) where id = 0;
update foo set v=(select v||v from foo where id = 0) where id = 0;

update foo set v=(select v from foo where id = 0) where id = 0;
# ... repeat ...
# ERROR: split failed while applying backpressure to Put [/Table/105/1/0/0,/Min), [txn: fa8036f0] on range r46:/{Table/105-Max} [(n1,s1):1, next=2, gen=2]: could not find valid split key

Expected behavior Gradual performance degradation without loss of write availability.

Additional data / screenshots

Jira issue: CRDB-21925

nvanbenschoten commented 1 year ago

Thanks for the write-up @jbowens!

I'm going to start by improving this error to link to https://www.cockroachlabs.com/docs/stable/common-errors.html#split-failed-while-applying-backpressure-are-rows-updated-in-a-tight-loop. I've done that in https://github.com/cockroachdb/cockroach/pull/93084.

Beyond that, this backpressure is an intentional guardrail to protect cluster health. We can't allow these ranges to grow indefinitely or we risk instability leaking across the entire cluster. So we need to slow down the rate of new writes somehow. In the past, we have discussed delaying new writes, but concluded that it was a worse UX than just erroring. The problem with delays is that they are difficult to isolate, diagnose, and understand. Instead of an error that immediately describes the problem and the solutions, customers are left asking "why is my cluster running slowly?" and "is CockroachDB broken?". Getting from that point to our FAQ then becomes quite an effort.

tbg commented 1 year ago

On a recent support issue^1 this also played a role - customer was (accidentally, I think) invoking 100ks worth of zone config changes and this caused backpressure on the spanconfig range, which had already split off everything but a single, large, row.

cockroachdb / cockroach

kv: mvcc garbage may cause a single row to be unavailable for writes #92705