apache / rocketmq

Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
https://rocketmq.apache.org/
Apache License 2.0
21.22k stars 11.68k forks source link

Broker can not service after the broker and namesrv are disconnected #2297

Closed tank1314 closed 4 years ago

tank1314 commented 4 years ago

The issue tracker is ONLY used for bug report(feature request need to follow RIP process). Keep in mind, please check whether there is an existing same report before your raise a new one.

Alternately (especially if your communication is not a bug report), you can send mail to our mailing lists. We welcome any friendly suggestions, bug fixes, collaboration and other improvements.

Please ensure that your bug report is clear and that it is complete. Otherwise, we may be unable to understand it or to reproduce it, either of which would prevent us from fixing the bug. We strongly recommend the report(bug report or feature request) could include some hints as the following:

BUG REPORT

  1. Please describe the issue you observed: 集群中的其中一个broker(brokerA)与其中的一台namesrv(共两台)节点仅有一次断开连接,导致与brokerA互为主从的从节点一直无法连接brokerA ;并且客户端程序也无法正常发送消息至brokerA ; org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <rocketmq-namesrv-prod01.cloud.bz/xx.xx.xx.xx:9876> timeout, 6000(ms)
    • What did you do (The steps to reproduce)? 当发现这个问题后,及时拒绝往brokerA写入消息数据(./mqadmin updateBrokerConfig -b xx.xx.xx.xx:10911 -n xx.x.xx.xx:9876 -k brokerPermission -v 4 )
    • What did you expect to see? 我期望当brokerA与namesrv节点状态复活后,不影响从节点同步主节点的数据,不影响客户端继续往该broker发送消息
    • What did you see instead? 实际上:当brokerA与namesrv连接恢复(ms级别)后,brokerA的从节点一直无法和主节点建立有效连接,提示(ERROR BrokerControllerScheduledThread1 - SyncDelayOffset Exception),客户端也无法继续正常发送消息
  2. Please tell us about your environment: 1、broker集群版本(4.0.0-incubating) 2、jdk:1.8.0_131-b11 3、rocketMQ客户端版本:4.4.0
  3. Other information (e.g. detailed explanation, logs, related issues, suggestions how to fix, etc): brokerA主节点的日志信息:
    2020-09-10 16:42:03 WARN BrokerControllerScheduledThread1 - registerBroker Exception, rocketmq-namesrv-prod01.cloud.bz:9876 org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <ocketmq-namesrv-prod01.cloud.bz/xx.x.xx.xx:9876> timeout, 6000(ms) at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.invokeSyncImpl(NettyRemotingAbstract.java:292) ~[rocketmq-remoting-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:338) ~[rocketmq-remoting-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.out.BrokerOuterAPI.registerBroker(BrokerOuterAPI.java:167) ~[rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.out.BrokerOuterAPI.registerBrokerAll(BrokerOuterAPI.java:116) ~[rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.BrokerController.registerBrokerAll(BrokerController.java:674) [rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.BrokerController$9.run(BrokerController.java:643) [rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_131] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] 2020-09-10 16:42:05 INFO BrokerControllerScheduledThread1 - register broker to name server rocketmq-namesrv-prod02.cloud.bz:9876 OK brokerA 从节点的日志信息 2020-09-10 16:42:05 INFO ClientManageThread_131 - subscription changed, group: MQ_nike_receive_consumer_group OLD: SubscriptionData [classFilterMode=false, topic=MQ_LSS_PUSH_NIKE_PRO, subString=, tagsSet=[], codeSet=[], subVersion=1599115705119] NEW: SubscriptionData [classFilterMode=false, topic=MQ_LSS_PUSH_NIKE_PRO, subString=, tagsSet=[], codeSet=[], subVersion=1599115711804] 2020-09-10 16:42:08 ERROR BrokerControllerScheduledThread1 - SyncConsumerOffset Exception, 1xx.xx.xx.xx:20911 org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel timeout, 3000(ms) at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.invokeSyncImpl(NettyRemotingAbstract.java:292) ~[rocketmq-remoting-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:338) ~[rocketmq-remoting-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.out.BrokerOuterAPI.getAllConsumerOffset(BrokerOuterAPI.java:254) ~[rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.slave.SlaveSynchronize.syncConsumerOffset(SlaveSynchronize.java:84) [rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating] at org.apache.rocketmq.broker.slave.SlaveSynchronize.syncAll(SlaveSynchronize.java:50) [rocketmq-broker-4.0.0-incubating.jar:4.0.0-incubating]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] 客户端日志信息: org.apache.rocketmq.remoting.exception.RemotingTooMuchRequestException: sendDefaultImpl call timeout at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.sendDefaultImpl(DefaultMQProducerImpl.java:634) ~[rocketmq-client-4.4.1.8.jar!/:4.4.1.8] at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.send(DefaultMQProducerImpl.java:1279) ~[rocketmq-client-4.4.1.8.jar!/:4.4.1.8] at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.send(DefaultMQProducerImpl.java:1225) ~[rocketmq-client-4.4.1.8.jar!/:4.4.1.8] at org.apache.rocketmq.client.producer.DefaultMQProducer.send(DefaultMQProducer.java:283) ~[rocketmq-client-4.4.1.8.jar!/:4.4.1.8] at com.baozun.scm.baseservice.message.rocketmq.service.server.RocketMQProducerServer.sendDataMsgConcurrently(RocketMQProducerServer.java:167) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.ecs.oms4.ofa.manager.receive.ReceiveBaseManager.sendConsumptionFeedback(ReceiveBaseManager.java:264) [classes!/:1.0.0-SNAPSHOT] at com.baozun.ecs.oms4.ofa.manager.receive.refund.impl.ReceiveRefundPaymentManagerImpl.receiveRefundPayment(ReceiveRefundPaymentManagerImpl.java:59) [classes!/:1.0.0-SNAPSHOT] at com.baozun.ecs.oms4.ofa.manager.receive.refund.impl.ReceiveRefundPaymentManagerImpl$$FastClassBySpringCGLIB$$f205e702.invoke() [classes!/:1.0.0-SNAPSHOT] at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) [spring-core-5.1.7.RELEASE.jar!/:5.1.7.RELEASE] at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:684) [spring-aop-5.1.7.RELEASE.jar!/:5.1.7.RELEASE] at com.baozun.ecs.oms4.ofa.manager.receive.refund.impl.ReceiveRefundPaymentManagerImpl$$EnhancerBySpringCGLIB$$33377ac5.receiveRefundPayment() [classes!/:1.0.0-SNAPSHOT] at sun.reflect.GeneratedMethodAccessor579.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_231] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_231] at com.baozun.scm.baseservice.message.rocketmq.service.MsgTranscationManagerImpl.businessProcessNoTransacation(MsgTranscationManagerImpl.java:187) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.MsgTranscationManagerImpl$$FastClassBySpringCGLIB$$6e604685.invoke() [common-message-component-rocketmq-1.7.0.4.jar!/:?] at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) [spring-core-5.1.7.RELEASE.jar!/:5.1.7.RELEASE] at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:684) [spring-aop-5.1.7.RELEASE.jar!/:5.1.7.RELEASE] at com.baozun.scm.baseservice.message.rocketmq.service.MsgTranscationManagerImpl$$EnhancerBySpringCGLIB$$f0d0e32e.businessProcessNoTransacation() [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.handle.MessageHandler.excuteBusinessMsg(MessageHandler.java:303) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.handle.MessageHandler.excuteHandle(MessageHandler.java:268) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.handle.MessageHandler.handle(MessageHandler.java:197) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.init.RocketMQConcurrentlyConsumerInit$1.consumeMessage$original$i6gTCfL5(RocketMQConcurrentlyConsumerInit.java:270) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.init.RocketMQConcurrentlyConsumerInit$1.consumeMessage$original$i6gTCfL5$accessor$50njQtqm(RocketMQConcurrentlyConsumerInit.java) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at com.baozun.scm.baseservice.message.rocketmq.service.init.RocketMQConcurrentlyConsumerInit$1$auxiliary$N718zRwF.call(Unknown Source) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:93) [skywalking-agent.jar:6.4.0] at com.baozun.scm.baseservice.message.rocketmq.service.init.RocketMQConcurrentlyConsumerInit$1.consumeMessage(RocketMQConcurrentlyConsumerInit.java) [common-message-component-rocketmq-1.7.0.4.jar!/:?] at org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService$ConsumeRequest.run(ConsumeMessageConcurrentlyService.java:417) [rocketmq-client-4.4.1.8.jar!/:4.4.1.8] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_231] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_231] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]

FEATURE REQUEST

  1. Please describe the feature you are requesting.

  2. Provide any additional detail on your proposed use case for this feature. 期望:当broker与namesrv连接恢复后,不影响客户端正常使用,不影响从节点同步主节点数据。

  3. Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have). Are you currently using any workarounds to address this issue? 不知是服务端版本有问题还是什么原因。

  4. If there are some sub-tasks using -[] for each subtask and create a corresponding issue to map to the sub task:

Jaskey commented 4 years ago

先说结论:如果仅仅是broker与nameserver的连接断了,实际上不影响客户端的使用。

  1. 客户端会通过nameserver 的感知能力发现broker下线,很快就不会再向这个broker发送消息了(客户端摘除该broker地址)。
  2. 在摘除之前的确可能还会继续发送这个broker,但是按照题设只是broker网络和nameserver断了,实际上不影响。

从你的日志上看,实际上不仅仅是断连的问题,看起来是网络或者是服务有问题:

org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <rocketmq-namesrv-prod01.cloud.bz/xx.xx.xx.xx:9876> timeout, 6000(ms)

所以本身需求来说是符合需求的,只是现在不是断连接的问题,更像是一个节点故障的问题。

tank1314 commented 4 years ago

@Jaskey 1、 出现问题的时候,该broker节点网络出口流量出现了网络陡增的情况(内网集群平时网络流量200M左右,出问题时,流量陡增至600M左右),同时该时刻点,该节点的CPU也增加到25%左右; 2、该异常总共就出现过1次,再后续的探测轮询时间点,都恢复了正常。观察该时间点的GC日志,没检测出什么异常来。

Jaskey commented 4 years ago

@tank1314 按照你的信息判断,实际上问题和nameserver断链没有关系,而是你内部机器的问题导致影响了你的写入性能,在这种情况下,有问题是正常的。如果你有多组broker,可以通过熔断来实现暂时屏蔽有问题的broker。但是如果是整体所有集群的broker都有类似问题,这是无法避免的。建议检查下你们部署上是否有考虑各种容灾的情况。

RongtongJin commented 4 years ago

@tank1314 I will close the Issue first, but please feel free to reopen it if you have any other issues.