Closed empiredan closed 2 years ago
I find you use many dassert
to validate the update
action, I have comment for some. My suggestion is that updating replica count is not a key read-write path. Its failure will not have a real-time impact on the business. You just need to return an error and tell us that it has not succeeded, rather than downtime the machine
I think it is strange to define two functions called update_partition_max_replica_count
, and update_next_partition_max_replica_count
.
You can update each partition's max replica count through while, such as
for(auto i = 0; i < partition_count; i++){
update_partition_max_replica_count(i);
}
Besides, I notice that #1072 support update app's max_replica_count, and in this pr, if partition update max_replica_count failed, app's value should be revert. I suggest that you can update the code like:
void update_app_max_replica_count(int new_replica_count) {
// 1. mark updating replica count to avoid repeated request(this flag should store on zk, you can add a new filed in envs)
// 2. update each partition replica count
// 3. after all partition update succeed, then update app replica count and mark not updating replica count
// function server_state::do_app_drop and server_state::process_one_partition may be helpful.
}
https://github.com/apache/incubator-pegasus/issues/865