cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Closed petermattis closed 6 years ago

petermattis commented 7 years ago

The work will all be upstream in etcd/raft. Filing an issue here for tracking purposes.

Forked from a comment on #18199:

13231 is mainly talking about reducing some inefficiencies in the scan for uncommitted config changes, but the scan would still be there. It's tricky to eliminate the scan completely given the current raft semantics (if we cached some information about the presence of config changes we'd have to update that when the uncommitted tail gets truncated).

Fortunately, with a little refactoring in etcd/raft I think we can avoid the need for a precise count. What we require here is to ensure that we never have more than one config change in flight at a time. If we simply assume pessimistically that the tail of the log has a config change (so that the new leader cannot propose a config change until it has applied all entries up to the point of its election) and we can skip the scan.

@petermattis says:

When would you clear raft.pendingConf? Currently that field gets cleared when the conf change is applied. Keeping track of the the current last index and watching for when that index is committed seems doable, but a bit tricky. Did you have a simpler idea in mind?

@bdarnell responds:

My idea is to track the last index (instead of a bool pendingConf, it would be configChangeBlockedUntilIndex). I don't think it will be tricky because it doesn't need to persist across leadership changes.

nvanbenschoten commented 6 years ago

Are we still planning on getting to this before the 2.0 release?

bdarnell commented 6 years ago

It's less crucial thanks to the quota pool (which limits the size of the uncommitted tail of the log), though it still has some value (we thought this was still worth doing even though this issue was created after the quota pool landed).

I don't know if it's going to make the cut for 2.0, though. I think it ranks below fixing PreVote (#18151) as far as raft changes go.

petermattis commented 6 years ago

I recall an incident post quota pool where we saw a very large uncommitted tail of the log due to re-proposals. I don't recall the details, but I think @a-robinson looked at this too and he has a fantastic memory for this stuff.

a-robinson commented 6 years ago

The incident I looked into (https://github.com/cockroachdb/cockroach/issues/15702) was pre-quota pool. A 40MB delete operation got re-proposed 66 times, kicking off the infinite cycle of raft elections. Even the first proposal triggered an election due to the high latency / low bandwidth, but if reproposals hadn't been allowed then things presumably wouldn't have spun so out of control.

Around the same time, we also saw it during the uncommon combination of a dropping a large database and running a restore at the same time while running on terrible disks (https://github.com/cockroachdb/cockroach/issues/15681).

18199 happened post-quota pool, though. I think understanding @bdarnell's questions in https://github.com/cockroachdb/cockroach/issues/18199#issuecomment-330252778 would be helpful for properly prioritizing this. I'll personally remain worried about it unless we know it actually happened while they were running 1.0.x or we understand how the quota pool didn't prevent it.

tbg commented 6 years ago

The quota pool also doesn't prevent reproposals, and the Raft log could grow that way too.

a-robinson commented 6 years ago

I guess I should have checked the code. If that's the case, then consider me still fairly worried about this.

bdarnell commented 6 years ago

https://github.com/coreos/etcd/pull/9073

xiang90 commented 6 years ago

@bdarnell

I think it ranks below fixing PreVote (#18151) as far as raft changes go.

Just let you know that we at etcd side is also going to put some effort to fix pre-vote in Q1 2018. We also want to enable it soon.

/cc @gyuho