apache / rocketmq

Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
https://rocketmq.apache.org/
Apache License 2.0
21.19k stars 11.67k forks source link

[Bug] 这个是rocketmq 4.7.1的重大设计缺陷吗? broker的master挂掉,client就不可用了 #6977

Closed hust419 closed 1 year ago

hust419 commented 1 year ago

Before Creating the Bug Report

Runtime platform environment

Linux CentOS7

RocketMQ version

4.7.1

JDK Version

No response

Describe the Bug

代码片段如下:这段代码来自https://github.com/apache/rocketmq-flink/的早期版本 this.executor.execute( () -> { RetryUtil.call( () -> { while (runningChecker.isRunning()) { try { long offset = getMessageQueueOffset(mq); PullResult pullResult = consumer.pullBlockIfNotFound( mq, tag, offset, pullBatchSize);

可以看到这个地方调用了getMessageQueueOffset,最终调用了这个方法

MQClientInstance.java

public String findBrokerAddressInPublish(final String brokerName) {
    HashMap<Long/* brokerId */, String/* address */> map = this.brokerAddrTable.get(brokerName);
    if (map != null && !map.isEmpty()) {
        return map.get(MixAll.MASTER_ID);
    }

    return null;
}

这个方法里面有个重大缺陷,他只会去master节点获取信息,如果master节点宕机不存在了,他就会返回null,整个程序也就不可用了

Steps to Reproduce

部署结构是1主1从,下线broker-master节点,观察消费情况发现无法正常消费

What Did You Expect to See?

下线broker-master节点后,应该从broker-slave节点继续消费,拉取信息,而不是挂掉,不影响消费

What Did You See Instead?

下线broker节点之后,程序发生异常,无法拉取到最新offset,因此无法继续消费

Additional Context

No response

lizhimins commented 1 year ago

从描述看像是客户端逻辑的问题,没有选择备节点。不过基于低版本的 flink connector 的线程模型和实现存在诸多问题,我已经基于 FLIP-27/191 重构了 connector 的实现,这几天应该会提交到社区,欢迎 review 和指点。

From the description, it seems like a client-side logic issue where no backup broker was selected. However, there are many issues with the thread model and implementation of the low version Flink connector. Therefore, I have refactored the connector's implementation based on FLIP-27/191 and will be submitting it to the community in the next few days. I welcome reviews and suggestions.