Unable to subscribe and publish messages

whaili commented 1 year ago

What happened?

We have an emqx cluster, the version is Version:5.0.20. Deployed in two data centers, each with 3 core nodes + 8 repl nodes. This cluster has been running stably for more than 100+ days. Suddenly, one afternoon, all subscription messages can’t be received. The problem has continued until now. We have preserved the incident scene for the purpose of troubleshooting.

What did you expect to happen?

Work normally.

How can we reproduce it (as minimally and precisely as possible)?

Unknown

Anything else we need to know?

The new client connections to the cluster can successfully subscribe and publish, but the existing connections are unable to receive subscription messages
I enable the debug logs of one of the repl nodes and also used log tracing to investigate. When publishing messages, it indicates a failure Reason code 16. However, when I checked through the Dashboard, it showed that there are subscribers for those topics
Currently, the system appears to be in an abnormal state. For example, I'm experiencing issues with the 'topic-metrics' functionality in the Dashboard; it's giving errors. I'm unable to delete displayed topics, and newly added topics are not appearing, among other things

EMQX version

```console $ ./bin/emqx_ctl broker ``` sysdescr : EMQX version : 5.0.20 datetime : 2023-09-06T15:38:15.255400654+08:00 uptime : 117 days, 22 hours, 6 minutes, 21 seconds

OS version

```console # On Linux: $ cat /etc/os-release ``` NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" $ uname -a Linux bjzyx-mqtt-repl-1dee78a8b-1.XXX.virtual 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Log files

2023-09-06T13:43:44.055428+08:00 [debug] msg: mqtt_packet_received, mfa: emqx_connection:handle_msg/2, line: 569, peername: 10.48.186.206:11722, clientid: MzEyNDU0OTM3NjU5ODg1NTU5ODc5Nzk0NTQyNDY2MTcwODI, packet: PUBLISH(Q1, R0, D0, Topic=upsmgmt-agent/10.75.80.181, PacketId=18197, Payload=["echo","{\"task_id\":\"b0e0ff59-ebc9-4115-b19c-e2e495b1ce34\",\"version\":\"v0.1.0-35\",\"ip\":\"10.191.56.11\",\"idc\":\"\",\"ts\":\"2023-09-06T13:43:44.000142405+08:00\",\"sub_map\":null,\"push_map\":null,\"callback\":\"\",\"p_callback\" :\ "10.75.80.181:6886\",\"cache_id\":\"640f220f-9s5d-4ab2-bs01-w64h03580f0f\"}"]), tag: MQTT 2023-09-06T13:43:44.059121+08:00 [debug] msg: publish_to, mfa: emqx_trace:publish/1, line: 74, peername: 10.48.186.206:11722, clientid: MzEyNDU0OTM3NjU5ODg1NTU5ODc5Nzk0NTQyNDY2MTcwODI, topic: upsmgmt-agent/10.75.80.181, payload: ["echo","{\"task_id\":\"b0e0ff59-ebc9-4115-b19c-e2e495b1ce34\",\"version\":\"v0.1.0-35\",\"ip\":\"10.191.56.11\",\"idc\":\"\",\"ts\":\"2023-09-06T13:43:44.000142405+08:00\",\"sub_map\":null,\"push_map\":null,\"callback\":\"\",\"p_callback\":\"10.75.80.181:6886\",\"cache_id\":\"640f220f-9s5d-4ab2-bs01-w64h03580f0f\"}"], tag: PUBLISH 2023-09-06T13:43:44.059662+08:00 [debug] msg: mqtt_packet_sent, mfa: emqx_connection:serialize_and_inc_stats_fun/1, line: 872, peername: 10.48.186.206:11722, clientid: MzEyNDU0OTM3NjU5ODg1NTU5ODc5Nzk0NTQyNDY2MTcwODI, packet: PUBACK(Q0, R0, D0, PacketId=18197, Reason Code=16), tag: MQTT ![img_1693986636573](https://github.com/emqx/emqx/assets/11846803/0a4a4582-9770-40a0-9f31-ddb7222bab1d)

ieQu1 commented 1 year ago

Hello,

It looks like you are experiencing a netsplit. Can you run ./bin/emqx_ctl cluster status command on all nodes (core and replicants) and show the result?

whaili commented 1 year ago

more info

Start time point of the issue was on 2023-08-30T17:5*, all node has log like below: 2023-08-30T17:56:02.026600+08:00 [error] Node 'emqx@10.63.118.104' not responding , Removing (timedout) connection

eg. log on 10.62.248.87 2023-08-30T17:55:54.514727+08:00 [error] msg: failed_to_kick_session_on_remote_node, mfa: emqx_cm:kick_session/3, line: 547, peername: 10.61.244.134:22286, clientid: edge_prober_820761900, action: discard, error: error, node: 'emqx@10.63.118.104', reason: {badrpc,timeout} 2023-08-30T17:55:58.586858+08:00 [error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection ** 2023-08-30T17:56:00.972330+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3, line: 416, message: <<"connection congested: #{buffer => 4096,clientid => <<\"MzEyMzc1MjAxNTgzNTY1NTM1NTMwMzM4MTYxMDM1MTgyMDI\">>,conn_state => connected,connected_at => 1693389360950,high_msgq_watermark => 8192,high_watermark => 1048576,memory => 147848,message_queue_len => 50,peername => <<\"10.48.186.202:56581\">>,pid => <<\"<0.26917.1365>\">>,proto_name => <<\"MQTT\">>,proto_ver => 4,recbuf => 369280,recv_cnt => 20,recv"...>>, name: <<"conn_congestion/MzEyMzc1MjAxNTgzNTY1NTM1NTMwMzM4MTYxMDM1MTgyMDI/xxxx">>

log on 10.63.118.104 2023-08-30T17:52:32.380265+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:52:32.380265+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381039+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381023+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:54:32.382148+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:54:32.382124+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T18:00:37.041810+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T18:00:37.770279+08:00 [warning] Mnesia overload: {dump_log,time_threshold}

Then we restart node 10.63.118.104

current status

/opt/soft/emqx/bin/emqx_ctl cluster status Cluster status: #{running_nodes => ['emqx@10.34.6.15','emqx@10.34.6.16','emqx@10.34.6.17', 'emqx@10.34.6.18','emqx@10.34.6.20','emqx@10.34.6.21', 'emqx@10.34.6.24','emqx@10.34.6.25','emqx@10.34.6.26', 'emqx@10.34.6.27','emqx@10.34.6.7', 'emqx@10.62.248.109','emqx@10.62.248.110', 'emqx@10.62.248.113','emqx@10.62.248.118', 'emqx@10.62.248.119','emqx@10.62.248.121', 'emqx@10.62.248.127','emqx@10.62.248.87', 'emqx@10.63.118.104','emqx@10.63.118.112', 'emqx@10.63.118.93'], stopped_nodes => []}

most of msg has been dropdped with Reason code 16

Incoming and Outgoing Messages were supposed to be roughly equal. reliao_img_1694005348365

ieQu1 commented 1 year ago

I see, thanks for the answer. It looks like the backplane network might have become overloaded. This could happen for a variety of reasons, depending on the pattern of traffic from the clients. Quick question: you mentioned that you have (3 core + 6 replicants) in two datacenters, i.e. 18 nodes in total. However, in the output of the cluster status command I counted 22 nodes. Do you know where do these extras come from?

Can you also collect system load report from a few nodes (1 core and 1 replicant in each DC) using the following instruction:

Login to the Erlang shell: emqx remote_console
Paste the following exact commands (including the full stop at the end): application:ensure_all_started(system_monitor). rr(system_monitor). timer:sleep(5100). rp(system_monitor:get_proc_top()).
Copy the output of the command

whaili commented 1 year ago

sorry for node num, it is (3 core + 8 replicants) * 2
system load report

core-dc1 core-dc1.txt
repl-dc1 repl-dc1.txt
core-dc2 core-dc2.txt
repl-dc2 repl-dc2.txt

ieQu1 commented 1 year ago

Hello,

Sorry for the late answer. The logs provided indicate that these logs

2023-08-30T17:52:32.380265+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:52:32.380265+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381039+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381023+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:54:32.382148+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:54:32.382124+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T18:00:37.041810+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T18:00:37.770279+08:00 [warning] Mnesia overload: {dump_log,time_threshold}

Are likely unrelated to the problem. I don't see the signs of mnesia being overloaded to the point it de-syncs. But this log message [error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection indicates that the network connection between the two nodes is slow or possibly down. This message is printed by the Erlang runtime when the remote peer doesn't respond to the heartbeat messages in time.

To explain why incoming and outgoing rates are different: there could be two possibilities:

Routing table hasn't got a chance to replicate between the nodes due to network problem or overload.
Session takeover fails and the clients cannot properly reconnect and resubscribe to the topics. The log message failed_to_kick_session_on_remote_node points at this possibility as well.

Hypothesis 1 can be checked by running the following command in the EMQX shell on one of the nodes (or maybe two different nodes in the two different DCs for good measure):

Nodes = [node() | nodes()].
erpc:call(Nodes, mria_rlog, status, []).

The second hypothesis can be substantiated by checking if any of the nodes contain error messages with the following taglines:

more_than_one_channel_found
session_stepdown_request_timeout
session_stepdown_request_exception

whaili commented 1 year ago

Cmd "erpc:call(Nodes, mria_rlog, status, [])" report exception:


Erlang/OTP 24 [erts-12.3.2.2] [emqx] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

Restricted Eshell V12.3.2.2 (abort with ^G) v5.0.20(emqx@10.63.118.104)1> Nodes = [node() | nodes()]. ['emqx@10.63.118.104','emqx@10.34.6.21','emqx@10.34.6.16', 'emqx@10.62.248.118','emqx@10.34.6.18','emqx@10.62.248.110', 'emqx@10.34.6.20','emqx@10.34.6.25','emqx@10.62.248.121', 'emqx@10.34.6.26','emqx@10.34.6.15','emqx@10.62.248.127', 'emqx@10.62.248.87','emqx@10.63.118.93', 'emqx@10.62.248.113','emqx@10.34.6.24','emqx@10.63.118.112', 'emqx@10.34.6.27','emqx@10.62.248.119','emqx@10.62.248.109', 'emqx@10.34.6.17','emqx@10.34.6.7'] v5.0.20(emqx@10.63.118.104)2> erpc:call(Nodes, mria_rlog, status, []). ** exception error: {erpc,badarg} in function erpc:call/5 (erpc.erl, line 148) v5.0.20(emqx@10.63.118.104)3>


2. Checked all nodes without error messages like "more_than_one_channel_found | session_stepdown_request_timeout |session_stepdown_request_exception"

ieQu1 commented 1 year ago

Thanks for checking.

RE: Cmd "erpc:call(Nodes, mria_rlog, status, [])" report exception:

Sorry my bad, the command should be erpc:multicall(Nodes, mria_rlog, status, []).

whaili commented 1 year ago

logs on two node( in the two different DCs ):

DC1 emqx@10.34.6.15.txt
DC2 emqx@10.63.118.104.txt

ieQu1 commented 1 year ago

Thanks. It appears that the nodes are operating normally at this point, but the netsplit in the past might have affected up the state of the routing table. Usually I'd expect this to recover automatically.

A few more questions:

Do the clients subscribe to topics with or without wildcards?
You mentioned that all nodes have log message [error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection ** Do all error messages mention the exact same node ('emqx@10.63.118.104'), or there are other nodes as well?

whaili commented 1 year ago

【Usually I'd expect this to recover automatically.】 Since the problem occurred, the system has been preserved until now. Although the nodes appear to be normal, the long connections that existed before the issue time have been unable to send and receive messages.

【Do the clients subscribe to topics with or without wildcards?】 Without wildcards

【 Do all error messages mention the exact same node】 Yes, only node 'emqx@10.63.118.104'

Looking forward to your reply. Is there any further information that needs to be collected to troubleshoot the problem? we will no longer retain the cluster after 2 days

ieQu1 commented 1 year ago

Hello,

Thanks for all the information. You can recycle the cluster now. I think I have enough information to try and reproduce the routing table de-sync.

nguyenvanquan7826 commented 6 months ago

Hi @ieQu1, I'm using Emqx 5.3.1 with 3 node in cluster and recently I also had a problem with the message: failed_to_kick_session_on_remote_node

2024-05-23T16:24:38.287648+07:00 [error] msg: failed_to_kick_session_on_remote_node, mfa: emqx_cm:kick_session/3(503), peername: 17.23.148.238:60165, clientid: App_ios_164832_o95y3l5zhm , action: discard, error: error , node: 'mqtt2@123.301.148.15', reason: {badrpc,timeout}

Can you tell me what happen in my cluster and how to fix this error. Thanks

ieQu1 commented 6 months ago

Hello @nguyenvanquan7826 ,

Your issue doesn't look related to the original issue. Please open a new one. Also, one log message is not enough to start investigation. It simply tells that a remote procedure call timed out. This could happen for any reason including network disturbance. We'll need full logs from all 3 nodes.

nguyenvanquan7826 commented 6 months ago

Hi @ieQu1 , I just open an issue at https://github.com/emqx/emqx/issues/13124 Please help me. Thanks!

emqx / emqx