camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.28k stars 598 forks source link

Zeebe can force remove a set of brokers #16145

Closed deepthidevaki closed 7 months ago

deepthidevaki commented 8 months ago

Context Zeebe deployed in 2 regions or 2 data centers. Replication factor and partition distribution ensures that quorum spans across two regions. So when one region is failed, no partitions will have a leader. As a result Zeebe is unavailable and cannot process any user requests.

Goal When one region fails, a user (admin/operator) can send an API call to Zeebe to remove the brokers in the failed regions. This call should remove the brokers from the replication group of all partitions that it was part of. When this operation completes, the number of replicas will be reduced for all partition to include only the brokers available in the surviving region. Effective replication factor is reduced. That is if the replicationFactor was 4 before, now the replicationFactor is 2. So the quorum required to elect a leader is also 2.

### Tasks
- [ ] https://github.com/camunda/zeebe/issues/16148
- [ ] https://github.com/camunda/zeebe/issues/16226
- [ ] https://github.com/camunda/zeebe/issues/16313
- [ ] https://github.com/camunda/zeebe/issues/16314
- [ ] https://github.com/camunda/zeebe/issues/16315
- [ ] https://github.com/camunda/zeebe/issues/16312
- [ ] https://github.com/camunda/zeebe/issues/16316
deepthidevaki commented 7 months ago

I'm thinking how the endpoints should look like. Here is my proposal.

To force remove a set of brokers, the user has to issue a force scale down with the surviving members. In a cluster with 4 brokers, to remove 1 and 3

POST actuator/cluster/brokers?force=TRUE  ["0", "2"]

Here the endpoint is same as scale down. But the parameter force=TRUE means it is force removing meaning we do not re-distribute partitions to keep the replicationFactor same.

When removing 0 and 2, the coordinator broker 0 is not available. We can handle it in two ways. Internally, we detect that 0 is being removed, so we can use the next lowest id which should be 1. (This should be handled in #16315). Or we can allow the user to input the coordinator id.

POST actuator/cluster/brokers?force=TRUE&coordinator=1 ["1", "3"]

To add the brokers back, we can have either a generic endpoint that accepts a new topology as input

POST actuator/cluster { <new topology> }

Or use the same endpoint for scaling, but give another parameter for the new replicationFactor.

POST actuator/cluster/brokers?replicationFactor=4   ["0", "1", "2", "3"]

The generic endpoint will be useful because it can be used for usecases where a specific endpoint is not available.

deepthidevaki commented 7 months ago

Usually only Broker 0 is allowed to start any configuration change. But during a force configuration change, broker 0 might be removed which means it cannot be used as the coordinator to process the request. In that case, we want to choose another coordinator. However, we have to prevent possible conflicting changes if we chose different coordinators at the same time. For example, in a 8 broker cluster let's say

  1. User called force remove (0,2,4,6). Surviving members are 1,3,5,7 => Broker 1 will be used as the coordinator
  2. User called force remove (2,3,4,6) Surviving members are 0,1,5,7 => Broker 0 will be the coordinator.

If the user called both requests in parallel, it will result in two concurrent conflicting "ClusterTopology" gossiped in parallel. This can result in an undefined state of the cluster. Raft configuration changes will also be conflicting. The current expectation is that the users should ensure that this won't happen. But we should also try to prevent it in a best effort way. We cannot prevent it 100% because since this is a force operations, the assumption is that there is no clear majority to choose a coordinator unambiguously.

However, to detect/prevent conflicting topology changes that arise due to force configuration changes, @npepinpe and me discussed the following idea.

Before starting a forced configuration change, the coordinator must get an approval from all the "surviving" members. This approval process could implement something like a 2-phase commit protocol. (Details can be discussed later.) The idea is that in the first operation, 1,3,5,7 agrees and enters into the configuration change first. If the user trigger the second request in parallel, the coordinator cannot get the agreement because 1,5,7 is already in the process of applying those changes. So the second request cannot be executed. The goal is to handle this in the cluster topology management so that we don't have to handle conflict detection or prevention in raft.

This would not prevent all cases. For example the user can call two requests to force remove (0,2,4,6) and force remove (1,3,5,7) in parallel and end up in a split brain situation. We cannot detect or prevent this. Because inherently "force" operations are unsafe, we expect users to know what they are doing. However, it still make sense to detect/prevent which ever we can.

For the first iteration, we will keep it simple and will not implement this conflict detection. We assume that only one force request is issued at a time. But will add this feature as soon as possible.