Currently, if a single kv in a batch loses last write wins or had an unexpected previous value, the whole batch is retried, where each row (i.e. pk kv + secondary index kvs) is committed in their own txn. To improve LDR ingestion throughput, we need to avoid this debatching behavior either by:
Improve heuristics that decide whether to batch a set of kvs. In other words, given a set of kvs, the heuristics can decide not to batch them if the batch is doomed to fail. The code already attempts to prevent patching if it detects duplicate keys in the batch, and conducts a form of backoff to prevent future batching after a debatching event. We suspect these heuristics can only go so far.
Teach the kvTableWriter (and the stack below), to better handle errors within a batch. Here are some possible ideas that build on top of each other:
a. If a whole row (i.e. the pk kv and its secondary index kvs) failed, then commit the batch. If a row partially succeeded, then abort the whole batch and use the slow single row batch fallback. If the row fully failed, send the row back to the client to retry. We believe this improved policy will gracefully handle duplicate keys that the rangefeed emits randomaly throughout the replicating key space.
b. Don't abort the txn if we observed a partially completed row. Instead, "undo" the successful cputs on the partially successful row, and commit the txn. Retry the unsuccessful rows separately.
Two more ideas discussed that I don't quite fully follow:
Instead of using cputs, we read the values that exist then write out what we want like a normal transaction. The idea here is cputs don’t provide as much value outside of the 1PC and small txns. MB take: while I understand that 2 round trips to kv isn't too bad, given that we're batching keys, I don't quite understand how we would reconcile the events in our rangefeed and the data we read in a performant manner. Like why would this reconciliation process be more performant than cputs?
We model each row as its own save point and abort save points that have failed cputs. MB take: brb gotta go learn about save points.
Other thoughts in the back of my mind: staged mode stinks.
Currently, if a single kv in a batch loses last write wins or had an unexpected previous value, the whole batch is retried, where each row (i.e. pk kv + secondary index kvs) is committed in their own txn. To improve LDR ingestion throughput, we need to avoid this debatching behavior either by:
Two more ideas discussed that I don't quite fully follow:
Other thoughts in the back of my mind: staged mode stinks.
Jira issue: CRDB-44431