apache / rocketmq-client-cpp

Apache RocketMQ cpp client
https://rocketmq.apache.org/
Apache License 2.0
366 stars 158 forks source link

consumer stuck in rebalance and doesn't consume messsages #293

Closed zhangyulu closed 2 years ago

zhangyulu commented 4 years ago
  1. Please describe the issue you observed:
  1. Please tell us about your environment:

    • What is your OS? CentOS Linux 7.6.1810

    • What is your client version? commit hash: b29bf5b69eab187a813a4d7f59534a6375e32d25 commited at 2020.03.23

    • What is your RocketMQ version? V4_3_0

  2. Other information (e.g. detailed explanation, logs, related issues, suggestions on how to fix, etc): rocketmq log: 2020-Mar-28 08:42:42.124297:unlock success of mq:MessageQueue [topic=browser, brokerName=broker-1, queueId=1][unlock:335] 2020-Mar-28 08:42:42.124426:PullRequest for [MessageQueue [topic=browser, brokerName=broker-1, queueId=2]] has been expired 9111 ms,m_lastPullTimestamp = 1585356033013 ms[isPullRequestExpired:211] 2020-Mar-28 08:42:42.124486:Drop mq:MessageQueue [topic=browser, brokerName=broker-1, queueId=2] according Pull timeout.[updateRequestTableInRebalance:501] 2020-Mar-28 08:42:42.124554:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=browser, brokerName=broker-1, queueId=2], its offset is:268305037[updateConsumeOffsetToBroker:310]

日志一直循环出现上述内容。 只出现在跨机房消费的场景(ping延迟40ms左右),网络可能会有波动但从未断开。每次消费者启动后可以正常运行一段时间(几个小时或者几天),但在某个时间点突然不再消费新消息,导致该消费者所分配到的队列堆积,也不会再分配给其他正常消费者。超时设置(try lock timeout & connect timeout)均使用默认值。 共部署4-6个消费者,任意一个消费者均多次出现过上述情况,必须重启消费者才能恢复。

zhangyulu commented 4 years ago

292645_rocketmq-cpp.log 这是实际的log文件,从17:37左右开始出现停止消费的问题,从17:41开始循环打印上述log

yiitz commented 4 years ago

我生产环境也遇到这个问题了,从1.2.3升级到2.2.0之后

ShannonDing commented 4 years ago

是顺序消息还是普通消息消费? @yiitz 方便提供一下消费停止时间附近的日志,看下是不是负载均衡过程中丢失了pullrequest。

ShannonDing commented 4 years ago

292645_rocketmq-cpp.log 这是实际的log文件,从17:37左右开始出现停止消费的问题,从17:41开始循环打印上述log

这个附件太大了,下不动。 方便的话,把有问题附近的日志单独取一下。

yiitz commented 4 years ago

是顺序消息还是普通消息消费? @yiitz 方便提供一下消费停止时间附近的日志,看下是不是负载均衡过程中丢失了pullrequest。

是普通消息消费,应用跑docker里的,日志重启无了,这几天暂时没有出现。

zhangyulu commented 4 years ago

292645_rocketmq-cpp.log 这是实际的log文件,从17:37左右开始出现停止消费的问题,从17:41开始循环打印上述log

这个附件太大了,下不动。 方便的话,把有问题附近的日志单独取一下。

292645_rocketmq-cpp.log

另外,我用的是顺序消息

ShannonDing commented 4 years ago

这里可能是个顺序消息的BUG。需要进一步排查下。 参考你的日志,

  1. 17:37:50网络抖动,拉取消息延迟1s重新触发 [2020-Mar-29 17:37:50.339075](error):pullMessageAsync failed of addr:10.10.10.11:10911, mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10][pullMessageAsync:490] [2020-Mar-29 17:37:50.339148](info):Pull MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10] occur exception, restart 1s later.[pullMessageAsync:905] [2020-Mar-29 17:37:50.339171](info):Produce Pull request [MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10]] Later and Sleep [1000]ms.[producePullMsgTaskLater:585]
  2. 负载均衡的过程也网络失败,获取不到消费分组信息,当前负责队列被清理 [2020-Mar-29 17:39:30.484327](info):getConsumerIdList from broker:10.10.10.11:10911[findConsumerIds:1103] [2020-Mar-29 17:39:30.517014](error):could not find clientId from Broker[allocate:67] [2020-Mar-29 17:39:30.517047](warning):allocated queue is empty for topic:browser[updateRequestTableInRebalance:479] [2020-Mar-29 17:39:30.517070](info):Drop mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=8],because not responsive[updateRequestTableInRebalance:492] [2020-Mar-29 17:39:30.517093](info):oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=8], its offset is:473946291[updateConsumeOffsetToBroker:310] [2020-Mar-29 17:39:30.549895](info):unlock success of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=8][unlock:335] [2020-Mar-29 17:39:30.549936](info):Drop mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=9],because not responsive[updateRequestTableInRebalance:492] [2020-Mar-29 17:39:30.549967](info):oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=9], its offset is:465195147[updateConsumeOffsetToBroker:310] [2020-Mar-29 17:39:30.582857](info):unlock success of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=9][unlock:335] [2020-Mar-29 17:39:30.582972](info):Drop mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10],because not responsive[updateRequestTableInRebalance:492] [2020-Mar-29 17:39:30.582999](info):oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10], its offset is:469681156[updateConsumeOffsetToBroker:310] [2020-Mar-29 17:39:30.615831](info):unlock success of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10][unlock:335]
  3. 之后不断尝试恢复消费,始终误判超时。 [2020-Mar-29 17:39:49.729030](info):CONSUME_FROM_LAST_OFFSET, lastOffset of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10] is:469683314[computePullFromWhere:567] [2020-Mar-29 17:39:49.729068](info):Add mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10], request initial offset:469683314[updateRequestTableInRebalance:527] [2020-Mar-29 17:39:49.896604](info):Start to pull MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10], offset:469683314, GroupName browser_2_sub_6[updateRequestTableInRebalance:538] MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10] [2020-Mar-29 17:39:49.930088](info):LockBatchResponseBody MQ:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10][Decode:89] [2020-Mar-29 17:39:49.930453](info):lock success of mq:MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10][lock:425] [2020-Mar-29 17:39:50.031055](info):Pull request for queue[MessageQueue [topic=browser, brokerName=broker-CuhDxsHLYZA-1, queueId=10]] has been set as dropped. Will NOT pull this queue any more[onSuccess:57]
ShannonDing commented 4 years ago

@zhangyulu @yiitz Could you pls have a review on the PR #293? If the issue still occur, pls feel free to reopen this issue.

zhangyulu commented 3 years ago

在线上又复现了若干次。用的版本是基于commit #b29bf5b69eab187a813a4d7f59534a6375e32d25,之后把PR #300中的变更手动合并进来,日志如下: 389_rocketmq-cpp.log 在03:59:03时出现"wait response timeout of heartbeat" 之后的几个小时里,都会打出 oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=topic_xxx, brokerName=broker-Cluster-kseILdRbPSF-0, queueId=0], its offset is:5255956 这个offset一直没有变化,消费者也一直无法拉取到消息进行消费,直到重启消费者进程才恢复

ifplusor commented 3 years ago

version: commit b29bf5b69eab187a813a4d7f59534a6375e32d25 + patch #300 consume model: push-orderly

ifplusor commented 3 years ago

@zhangyulu master分支的消费管理有点乱,你能把你当前使用的源码库发出来吗?或者,你也可以尝试切换到rd_dev分支。

zhangyulu commented 3 years ago

@zhangyulu master分支的消费管理有点乱,你能把你当前使用的源码库发出来吗?或者,你也可以尝试切换到rd_dev分支。

@ifplusor rocketmq-client-cpp-debug.tar.gz

zhangyulu commented 3 years ago

@ShannonDing 能否帮忙看一下这个问题呢

ifplusor commented 3 years ago

@zhangyulu 你这个问题我之前看了,从已有的日志里没有分析出问题的原因,建议换 re_dev 分支的版本看看效果。

zhangyulu commented 3 years ago

re_dev分支稳定性如何?另外昨天试了一下,部分api不兼容(如setSessionCredentials),需要修改代码

ifplusor commented 3 years ago

re_dev分支稳定性如何?另外昨天试了一下,部分api不兼容(如setSessionCredentials),需要修改代码

比master稳定很多。re_dev的api重新设计了,所以不兼容。但后面可能还会加一层专门用于向后兼容的wrapper,社区正在评估。

zhangyulu commented 3 years ago

刚刚试了一下,用re_dev分支编译出来example里面的OrderlyConsumer去消费一个topic,一直拉不到消息,但使用PushConsumer就能够正常消费 8348_rocketmq-cpp.log

ifplusor commented 3 years ago

@zhangyulu RebalanceImpl::lock() 方法有bug,我已经改了。

zhangyulu commented 3 years ago

合并更改之后还是无法消费 tmp.log

ifplusor commented 3 years ago

@zhouxinyu 调用下面这个命令看一下线程栈,PID是你的进程号

gdb -ex "thread apply all bt" -batch -p $PID
zhangyulu commented 3 years ago

stack.log

ifplusor commented 3 years ago

@zhangyulu 我做了一些测试,但没办法复现你的遇到的问题。在你的日志里,负载均衡、mq加锁、offset上报都是正常的。你可以先检查一下参数设置和consumer group的积压情况,并在DefaultMQPushConsumerImpl::executePullRequestImmediately()函数里加一些日志,验证消息拉取是否正常。

YellowOldOdd commented 3 years ago

嗨,你好,我也遇到了一摸一样的问题,版本2.2.0,consumer数量发生变化时,PushConsumer在Rebalance时部分queue卡住 @ifplusor @ShannonDing

YellowOldOdd commented 3 years ago

这是某个queue被卡住时的日志

2021-Jun-30 11:18:00.194443:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971080[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:05.192152:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971082[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:10.194912:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971084[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:15.192283:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971087[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:20.194508:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971087[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:25.192197:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971088[updateConsumeOffsetToBroker:310] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] 2021-Jun-30 11:18:30.194282:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971090[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:30.891084:Drop mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3],because not responsive[updateRequestTableInRebalance:493] 2021-Jun-30 11:18:30.891111:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:30.932469:CONSUME_FROM_LAST_OFFSET, lastOffset of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] is:7971091[computePullFromWhere:568] 2021-Jun-30 11:18:30.932487:Add mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], request initial offset:7971091[updateRequestTableInRebalance:528] 2021-Jun-30 11:18:30.932550:Start to pull MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], offset:7971091, GroupName note-change-media-v2-consumer[updateRequestTableInRebalance:539] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] 2021-Jun-30 11:18:30.939869:Drop mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3],because not responsive[updateRequestTableInRebalance:493] 2021-Jun-30 11:18:30.939894:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:31.384323:CONSUME_FROM_LAST_OFFSET, lastOffset of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] is:7971091[computePullFromWhere:568] 2021-Jun-30 11:18:31.384341:Add mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], request initial offset:7971091[updateRequestTableInRebalance:528] 2021-Jun-30 11:18:31.384408:Start to pull MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], offset:7971091, GroupName note-change-media-v2-consumer[updateRequestTableInRebalance:539] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] 2021-Jun-30 11:18:31.401570:Drop mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3],because not responsive[updateRequestTableInRebalance:493] 2021-Jun-30 11:18:31.401596:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:31.422630:CONSUME_FROM_LAST_OFFSET, lastOffset of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] is:7971091[computePullFromWhere:568] 2021-Jun-30 11:18:31.422653:Add mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], request initial offset:7971091[updateRequestTableInRebalance:528] 2021-Jun-30 11:18:31.422711:Start to pull MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], offset:7971091, GroupName note-change-media-v2-consumer[updateRequestTableInRebalance:539] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] 2021-Jun-30 11:18:31.561148:Pull request for queue[MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3]] has been set as dropped. Will NOT pull this queue any more[onSuccess:58] 2021-Jun-30 11:18:31.561140:Pull request for queue[MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3]] has been set as dropped. Will NOT pull this queue any more[onSuccess:58] 2021-Jun-30 11:18:31.561183:Pull request for queue[MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3]] has been set as dropped. Will NOT pull this queue any more[onSuccess:58] 2021-Jun-30 11:18:31.561214:Pull request for queue[MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3]] has been set as dropped. Will NOT pull this queue any more[onSuccess:58] 2021-Jun-30 11:18:35.191912:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3] 2021-Jun-30 11:18:40.194706:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:45.192151:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:50.194279:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310] 2021-Jun-30 11:18:55.192147:oneway updateConsumeOffsetToBroker of mq:MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3], its offset is:7971091[updateConsumeOffsetToBroker:310]

YellowOldOdd commented 3 years ago

基本可以稳定复现,请问有什么推荐的解决方法吗?

YellowOldOdd commented 3 years ago

并且在接下来的几个小时里,会一直报 2021-Jun-30 14:31:20.195386:PullRequest for [MessageQueue [topic=note-change, brokerName=cluster02-broker10, queueId=3]] has been expired 9987 ms,m_lastPullTimestamp = 1625034550208 ms[isPullRequestExpired:211] 的超时错误,然后逐渐恢复了。。

YellowOldOdd commented 3 years ago

然后,"任意一个消费者均多次出现过上述情况,必须重启消费者才能恢复"。

gaoyf commented 2 years ago

我也是,日志也一样,@YellowOldOdd @ifplusor @ShannonDing

ifplusor commented 2 years ago

嗨,你好,我也遇到了一摸一样的问题,版本2.2.0,consumer数量发生变化时,PushConsumer在Rebalance时部分queue卡住

@ifplusor @ShannonDing

你是什么消费模式?

ifplusor commented 2 years ago

我也是,日志也一样,@YellowOldOdd @ifplusor @ShannonDing

你是什么消费模式?

gaoyf commented 2 years ago

我也是,日志也一样,@YellowOldOdd @ifplusor @ShannonDing

你是什么消费模式?

就是PushConsumer,我一共有4个broker,每个broker8个队列,启动了30个消费者,这个问题必现。跟@YellowOldOdd的一模一样,我看master代码已经修改了 #5629226997fea8bc6ac91e9481d2140d983106a7,应该就是这个问题,能不能release一个2.2.1呢,这个问题太严重了

ifplusor commented 2 years ago

@ShannonDing

YellowOldOdd commented 2 years ago

嗨 @gaoyf , 请问用了#5629226997fea8bc6ac91e9481d2140d983106a7的改动后,问题有解决吗?

Jerlensu commented 2 years ago

请问这个问题解决了吗 ?

ShannonDing commented 2 years ago

see #300

tqqq commented 2 years ago

我们也遇到了这个问题,最后基于master分支重新编译解决了(re-dev分支编译过,但是还是有问题,因为没有ISSUE293这个commit)

beibei198918 commented 1 year ago

commit 5629226997fea8bc6ac91e9481d2140d983106a7 解决队列堵塞问题后,导致消费者重复消费

StruggleLinzi commented 1 year ago

我也是,日志也一样,@YellowOldOdd @ifplusor @ShannonDing

你是什么消费模式?

就是PushConsumer,我一共有4个broker,每个broker8个队列,启动了30个消费者,这个问题必现。跟@YellowOldOdd的一模一样,我看master代码已经修改了 #5629226997fea8bc6ac91e9481d2140d983106a7,应该就是这个问题,能不能release一个2.2.1呢,这个问题太严重了

@ShannonDing 能不能release一个2.2.1呢