apache / rocketmq-operator

Apache RocketMQ Operator
https://rocketmq.apache.org/
Apache License 2.0
308 stars 126 forks source link

Broker show `alive=false` in controller `getSyncStateSet` command #213

Open drivebyer opened 4 months ago

drivebyer commented 4 months ago

BUG REPORT

  1. Please describe the issue you observed: I deployed three controllers, two brokers, and one nameserver using an operator. After ensuring all pods were ready, I executed commands on the nameserver and the controllers.

On the nameserver, I ran the following command:

[root@master0 ~]# kubectl -n mcamel-system exec -it name-service-0 -- ./mqadmin clusterList -n 127.0.0.1:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)     #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
broker                  broker-0                0     192.168.137.126:10911  V5_1_4                 0.00(0,0ms)       0.00(0,0ms)  0-0(0.0w, 0.0, 0.0)               0  474775.65     0.6800          true
broker                  broker-0                2     192.168.84.199:10911   V5_1_4                 0.00(0,0ms)       0.00(0,0ms)  2-0(0.0w, 0.0, 0.0)               0  474775.65     0.6500         false

The output seemed to be satisfactory.

On the controller, I executed:

[root@master0 ~]# kubectl -n mcamel-system exec -it controller-1 -- ./mqadmin getSyncStateSet -a 127.0.0.1:9878 -c broker -b broker-0

#brokerName broker-0
#MasterBrokerId 1
#MasterAddr 192.168.137.126:10911
#MasterEpoch    1
#SyncStateSetEpoch  1
#SyncStateSetNums   1

InSyncReplica:  ReplicaIdentity{brokerName='broker-0', brokerId=1, brokerAddress='192.168.137.126:10911', alive=true}

NotInSyncReplica:   ReplicaIdentity{brokerName='broker-0', brokerId=2, brokerAddress='192.168.84.199:10911', alive=false}

It appears that the address 192.168.84.199:10911 is not alive with respect to the controller.

Additionally, I discovered an error log on 192.168.137.126:10911:

2024-02-29 15:50:26 ERROR AutoSwitchHAService_Executor_1 - Error happen when change SyncStateSet, broker:broker-0, masterAddress:192.168.137.126:10911, masterEpoch:1, oldSyncStateSet:[1], newSyncStateSet:[1, 2], syncStateSetEpoch:1
org.apache.rocketmq.client.exception.MQBrokerException: CODE: 2006  DESC: Rejecting alter syncStateSet request because the replicas {2} don't alive
For more information, please visit the url, https://rocketmq.apache.org/docs/bestPractice/06FAQ
    at org.apache.rocketmq.broker.out.BrokerOuterAPI.alterSyncStateSet(BrokerOuterAPI.java:1215)
    at org.apache.rocketmq.broker.controller.ReplicasManager.doReportSyncStateSetChanged(ReplicasManager.java:761)
    at org.apache.rocketmq.store.ha.autoswitch.AutoSwitchHAService.lambda$null$0(AutoSwitchHAService.java:263)
    at java.util.ArrayList.forEach(ArrayList.java:1257)
    at org.apache.rocketmq.store.ha.autoswitch.AutoSwitchHAService.lambda$notifySyncStateSetChanged$1(AutoSwitchHAService.java:263)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2024-02-29 15:50:30 INFO ReplicasManager_ScheduledService_1 - Update controller leader address to controller-1.controller-svc-headless:9878
2024-02-29 15:50:31 ERROR AutoSwitchHAService_Executor_1 - Error happen when change SyncStateSet, broker:broker-0, masterAddress:192.168.137.126:10911, masterEpoch:1, oldSyncStateSet:[1], newSyncStateSet:[1, 2], syncStateSetEpoch:1
org.apache.rocketmq.client.exception.MQBrokerException: CODE: 2006  DESC: Rejecting alter syncStateSet request because the replicas {2} don't alive
For more information, please visit the url, https://rocketmq.apache.org/docs/bestPractice/06FAQ
    at org.apache.rocketmq.broker.out.BrokerOuterAPI.alterSyncStateSet(BrokerOuterAPI.java:1215)
    at org.apache.rocketmq.broker.controller.ReplicasManager.doReportSyncStateSetChanged(ReplicasManager.java:761)
    at org.apache.rocketmq.store.ha.autoswitch.AutoSwitchHAService.lambda$null$0(AutoSwitchHAService.java:263)
    at java.util.ArrayList.forEach(ArrayList.java:1257)
    at org.apache.rocketmq.store.ha.autoswitch.AutoSwitchHAService.lambda$notifySyncStateSetChanged$1(AutoSwitchHAService.java:263)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
  1. Please tell us about your environment: RocketMQ 5.1.4

  2. Other information (e.g. detailed explanation, logs, related issues, suggestions how to fix, etc): When I deploy a single-replica controller, this issue does not occur.

drivebyer commented 4 months ago

@caigy PTAL

drivebyer commented 4 months ago

在主库的讨论见:https://github.com/apache/rocketmq/discussions/7877