Closed petermattis closed 6 years ago
Are we still planning on getting to this before the 2.0 release?
It's less crucial thanks to the quota pool (which limits the size of the uncommitted tail of the log), though it still has some value (we thought this was still worth doing even though this issue was created after the quota pool landed).
I don't know if it's going to make the cut for 2.0, though. I think it ranks below fixing PreVote (#18151) as far as raft changes go.
I recall an incident post quota pool where we saw a very large uncommitted tail of the log due to re-proposals. I don't recall the details, but I think @a-robinson looked at this too and he has a fantastic memory for this stuff.
The incident I looked into (https://github.com/cockroachdb/cockroach/issues/15702) was pre-quota pool. A 40MB delete operation got re-proposed 66 times, kicking off the infinite cycle of raft elections. Even the first proposal triggered an election due to the high latency / low bandwidth, but if reproposals hadn't been allowed then things presumably wouldn't have spun so out of control.
Around the same time, we also saw it during the uncommon combination of a dropping a large database and running a restore at the same time while running on terrible disks (https://github.com/cockroachdb/cockroach/issues/15681).
The quota pool also doesn't prevent reproposals, and the Raft log could grow that way too.
I guess I should have checked the code. If that's the case, then consider me still fairly worried about this.
@bdarnell
I think it ranks below fixing PreVote (#18151) as far as raft changes go.
Just let you know that we at etcd side is also going to put some effort to fix pre-vote in Q1 2018. We also want to enable it soon.
/cc @gyuho
The work will all be upstream in
etcd/raft
. Filing an issue here for tracking purposes.Forked from a comment on #18199:
@petermattis says:
@bdarnell responds: