Dead FEs should not be used for FE consensus/leader election

StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

https://starrocks.io

Apache License 2.0

9.28k stars 1.84k forks source link

Dead FEs should not be used for FE consensus/leader election #32334

Closed mcgray closed 11 months ago

mcgray commented 1 year ago

What we see currently is that in case we are terminating FEs one by one (say to apply updates to node/SR binary) in case we are NOT cleaning up dead nodes via "alter system drop follower" cluster becomes broken with error that leader cannot be established. We suggest to NOT include dead nodes in an election process.

imay commented 1 year ago

You can add frontends by alter system add observer, then the frontends will serve as a observer which doesn't participate in the election.

mcgray commented 1 year ago

This is not what I need actually. I need a nodes that would be able to become a leader.

mcgray commented 1 year ago

My scenario. Having 3 FE nodes inside ASG. Terminating one. It becomes unavailable in 'SHOW FRONTENDS'. New one appears in the list and it is available and active (Follower). Next terminating another "old" node. It is being replaced with one. So I have Leader. Two active followers and two dead followers. That is it. I can no longer access the cluster. In case I am removing dead followers after each terminations - no issues. But I do not see the reason I should do that.

nshangyiming commented 1 year ago

Hi @mcgray, thanks for reporting the issue.

By saying "Terminating one", you mean just terminate one FE process or the vm, right? I understand you want to replace the FE node one by one, but just terminating the FE process is not enough, because SR cannot identify the situation whether you want to replace that node or just that node have some fatal error, like vm crash, and later you want to recover it.

SR use a consensus algorithm to keep consistency between FE nodes and also process leader election. What you're doing is actually changing the membership of the consensus group. So you need to explicitly remove the original member from the group, that's the requirement of the consensus algorithm. For SR, that will be executing alter system drop follower "fe_host:edit_log_port"; before terminating the node.

mcgray commented 1 year ago

Thank you for quick response @nshangyiming . I clearly understand that this is the way your consensus works. What is not clear for me is why you are using nodes that are not healthy/responding in your consensus group. Let me simply make a suggestion - can you remove dead nodes after certain time from this group?

alberttwong commented 1 year ago

I believe that not healthy nodes are part of the Raft Consensus Algorithm design because they can be healthy in the future. https://raft.github.io/