Leader election getting triggered on snapshot

lafolle commented 6 years ago

Configuration

etcd Version: 3.3.8 Git SHA: 33245c6b5 Go Version: go1.10.3 Go OS/Arch: linux/amd64 --snapshot-count: 100 --heartbeat-timeout: 100ms --election-timeout: 1000ms Number of nodes: 1M

What we observed?

Leader election getting triggered on snapshotting.

What we think is the cause

Abbreviations:

G :: Goroutine
E :: Event

Here is rough flow of control starting from sending heartbeat.

+ *G1* raftNode.start() ticks at heartbeatTimeout: r.ticker.C -> t.tick()
  + raftNode.tick()
    + raft/node.Tick() =writes to n.tickc channel=
      + wakes up *G2*
+ *G2* raft/node.run() picks from tickc channel r.tickHeartbeat()
  + r.Step(type = pb.MsgBeat)
    + r.stepLeader()
      + r.bcastHeartbeat()
        + r.bcastHeartbeatWithCtx(nil)
          + r.sendHeartbeat()
            + r.send() =appends msg to r.msgs=
+ *G2* raft/node.run() EL next iteration  creates payload (rd = newReady)
  and sets readyc = n.readyc if advancec is nil.
  + readyc <- rd
    + wakes up *G1*
+ *G1* raftNode.start() EL read from readyc
  + r.applyc <- apply /E1: SEND DATA/
    + wakes up *G4*
  + r.transport.Send =if leader= /E2: HEARTBEAT SENT HERE/
+ *G4* etcdserver.run EL
  + ap := <-s.r.apply() (<-applyc)
    + asynchronously schedules applyAll()
      - execution (logically) shifts to sched *G3*.
  + <-getSync() =ticking every 500ms=
    + s.store.HasTTLKeys()
      + store.worldLock.RLock() /E3/
      + store.worldLock.RUnlock()
+ *G3* FifoScheduler.run EL picks job
  + etcdServer.applyAll()
    + etcdServer.applySnapshot()
      + <-apply.notifyc
    + etcdServer.applyEntries()
    + <-apply.notifyc
    + etcdServer.triggerSnapshot()
      + etcdServer.snapshot() =if applicable=
        + etcdServer.store.Clone()
          + store.worldLock.Lock() /E4: STORE FULL LOCK AT SNAPSHOT/
          + ---some work---
          + store.worldLock.Unlock()

When snapshot is triggered, G3:E4 can take more than 1s to finish (in our case store is 1M nodes) which possiblty will block G4 when getSync is triggered (default: 500ms) at G4:E3, which in turn blocks G1 at G1:E1, which delays sending heartbeat at G1:E2.

Possible Solution

If what we think is true then, for one, work done in getSync() can be done in a goroutine. In fact, this indeed made the cluster stable, though as we're not very familiar with the conceptual working of etcd internals any possible negaitive repercussions have not been well thought of by us.
Make clone take less time: is it possible to make this differential ie not parsing the whole tree but just subset to figure out what goes in snapshot?

For the record we initially faced this problem with v2.2.0, but in that case snapshot was synchronous in event loop because of which store.Clone did trigger missed heartbeats. Hence we bumped up the version to v3.3.8 but still faced the problem which tripped us because snapshot seemed to be done asynchronously.

Let me know if more data is needed.

Thanks, Karan

xiang90 commented 6 years ago

the non-blocking snapshot only works for etcd3 backend not etcd2 backend. simply changing the version of etcd wont help. you actually need to migrate the data from etcd2 backend to 3.

lafolle commented 6 years ago

Hi @xiang90 , migration is not a problem for us.

But when we loaded etcdv3.3.8 with test data (~>1M keys) it started triggering elections specifically at the time of snapshotting because of reasons i have mentioned above.

Do you think our reasoning is valid?

Thanks, Karan

xiang90 commented 6 years ago

how do you load the test data?

lafolle commented 6 years ago

We load data using go etcd client library.

The problem we think is that etcdserver.store.HasTTL() gets stuck on lock during snapshotting (as store.Clone acquires full lock and can take time) which delays sending heartbeat to followers.

etcdserver/v2store.HasTTL(): https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1051 etcdserver/v2store.Clone(): https://github.com/coreos/etcd/blob/master/etcdserver/api/v2store/store.go?#L750

xiang90 commented 6 years ago

@lafolle You are still using v2 backend. See what I said above. I am going to close this issue as we are not going to scale up v2 backend.

Belyenochi commented 4 years ago

I reproduced the problem in etcd 3.2.3.

etcd-io / etcd