Have a FIFO queue of in-flight proposals. They are added sequentially, and have consecutive "lease indices", so no fancy data structures are needed (except maybe some MPSC support for concurrent writes, like in the proposal buffer).
In fact, this possibly can be integrated with the proposals buffer which is a similar queue. The only difference is that we would keep proposals in the queue until they are applied or got behind LAI, not just until they are flushed to raft.
Integrating with the proposal buffer though can be hardened by the fact that refreshProposalsLocked wants to push duplicates to it. So we may still need 2 separate queues.
When LAI state changes, wipe a prefix of this queue up to the new LAI (after reporting all the successfully applied commands). Decide on the unapplied / unknown to be applied commands:
For proposals that can be reproposed with a newer LAI, do it; push back to the queue correspondingly.
For proposals that can't be reproposed (and won't be), report a permanent failure.
For proposals that were jumped over (like when there was a snapshot), report an ambiguous failure and don't repropose.
When a replica is destroyed, or lease moves (figure out exactly what those conditions are), wipe any outstanding proposals in the queue and prevent inserts to it. Report ambiguous errors for them.
Not sure if the proposals map is needed in the first place. The above data structure is searchable by LAI in O(1) since LAI is strictly +1 incremental. The map would only be needed if we additionally need to search by the proposal ID, but I don't think we need to. We will use the command ID only for sanity checks to make sure that search in the queue by LAI finds the correct command.
We do need to track multi-proposal proposals though (those that update LAI and repropose). From the above, we have the invariant that an in-flight proposal is always in the queue and always exactly once, so we can store all this information right in the queue. Or we could factor out some bits that are shared across multiple reproposals.
Do not touch the proposals state (and proposals queue) from refreshProposalsLocked at all. All it needs to do is scan the queue, and insert slow proposals to the proposal buffer. This nicely separates concerns. We may replace the queue scanning with a better heap-based approach, so that we don't have to scan the entire queue all the time (as we do now).
The first queue assigns LAI and ClosedTimestamp sequentially, has no dups, and is thus searchable by LAI. A proposal stays in this queue until the replica's LAI moves past it, and then it is either reproposed or dropped. This queue can be wiped sequentially, so no proposal "leaks".
The second queue is a buffer between this one and raft. It doesn't need to be ordered by LAI for correctness (but needs to in order to make progress), and can have dups. In some sense it's LAI/CT-agnostic. Proposals are added to this queue once, and then periodically re-added if still present in the first queue.
Possibly the second queue/buffer isn't needed, because a raft batch can be constructed from the first data structure directly: scan it and send everything to raft except things that were recently sent.
Get rid of proposals map, in favor of the first queue which is intiutive and sequential.
Forking off https://github.com/cockroachdb/cockroach/issues/115020#issuecomment-1849918456.
I'm leaning towards a clearer design, like this:
Have a FIFO queue of in-flight proposals. They are added sequentially, and have consecutive "lease indices", so no fancy data structures are needed (except maybe some MPSC support for concurrent writes, like in the proposal buffer).
refreshProposalsLocked
wants to push duplicates to it. So we may still need 2 separate queues.When LAI state changes, wipe a prefix of this queue up to the new LAI (after reporting all the successfully applied commands). Decide on the unapplied / unknown to be applied commands:
When a replica is destroyed, or lease moves (figure out exactly what those conditions are), wipe any outstanding proposals in the queue and prevent inserts to it. Report ambiguous errors for them.
Not sure if the
proposals
map is needed in the first place. The above data structure is searchable by LAI in O(1) since LAI is strictly +1 incremental. The map would only be needed if we additionally need to search by the proposal ID, but I don't think we need to. We will use the command ID only for sanity checks to make sure that search in the queue by LAI finds the correct command.We do need to track multi-proposal proposals though (those that update LAI and repropose). From the above, we have the invariant that an in-flight proposal is always in the queue and always exactly once, so we can store all this information right in the queue. Or we could factor out some bits that are shared across multiple reproposals.
Do not touch the proposals state (and proposals queue) from
refreshProposalsLocked
at all. All it needs to do is scan the queue, and insert slow proposals to the proposal buffer. This nicely separates concerns. We may replace the queue scanning with a better heap-based approach, so that we don't have to scan the entire queue all the time (as we do now).Jira issue: CRDB-34385
Epic CRDB-39898