etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.24k stars 9.7k forks source link

Investigate processes executed by leader #15944

Open serathius opened 1 year ago

serathius commented 1 year ago

What would you like to be added?

Etcd has multiple processes that executed by leader, however neither etcd nor raft guarantees that there is only 1 cluster member that identifies itself as a leader at one time.

Raft only guarantees that commited entries will not be lost by ensuring that they are persisted on quorum of members and electing a leader requires a quorum. However, there are periods of time where two members can consider themselves a leader, even though in reality they no longer have quorum and cannot commit any entries. An example is when leader is disconnected from other members and they elect new leader, there will be a period of time when old and new leader will be present.

TODO:

Why is this needed?

To surface the issue with incorrect pattern used in etcd and improve the codebase.

mitake commented 1 year ago

Below requests can be issued by a stale leader and related to this issue:

It’s not related to the stale leader issue but Alarm might have similar behavior. In theory a stale leader can have a long duration between checking space and issuing a request of alarm in Put (or Txn, or LeaseGrant).

I think a simple solution for this issue might be not using Raft messages of MsgProp type for these requests. If an etcd server can just issue MsgApp message, the message can have a term information and can be rejected if its cluster has a newer leader.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr commented 10 months ago

Another example, s.raftStatus() might return stale info if current leader is in network partition status. https://github.com/etcd-io/etcd/blob/3347568cc0684326a71cb110204a90827bf00399/server/etcdserver/server.go#L1491-L1500