Open whaili opened 1 year ago
Hello,
It looks like you are experiencing a netsplit.
Can you run ./bin/emqx_ctl cluster status
command on all nodes (core and replicants) and show the result?
Start time point of the issue was on 2023-08-30T17:5*, all node has log like below: 2023-08-30T17:56:02.026600+08:00 [error] Node 'emqx@10.63.118.104' not responding , Removing (timedout) connection
eg. log on 10.62.248.87
2023-08-30T17:55:54.514727+08:00 [error] msg: failed_to_kick_session_on_remote_node, mfa: emqx_cm:kick_session/3, line: 547, peername: 10.61.244.134:22286, clientid: edge_prober_820761900, action: discard, error: error, node: 'emqx@10.63.118.104', reason: {badrpc,timeout} 2023-08-30T17:55:58.586858+08:00 [error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection ** 2023-08-30T17:56:00.972330+08:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3, line: 416, message: <<"connection congested: #{buffer => 4096,clientid => <<\"MzEyMzc1MjAxNTgzNTY1NTM1NTMwMzM4MTYxMDM1MTgyMDI\">>,conn_state => connected,connected_at => 1693389360950,high_msgq_watermark => 8192,high_watermark => 1048576,memory => 147848,message_queue_len => 50,peername => <<\"10.48.186.202:56581\">>,pid => <<\"<0.26917.1365>\">>,proto_name => <<\"MQTT\">>,proto_ver => 4,recbuf => 369280,recv_cnt => 20,recv"...>>, name: <<"conn_congestion/MzEyMzc1MjAxNTgzNTY1NTM1NTMwMzM4MTYxMDM1MTgyMDI/xxxx">>
log on 10.63.118.104
2023-08-30T17:52:32.380265+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:52:32.380265+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381039+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381023+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:54:32.382148+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:54:32.382124+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T18:00:37.041810+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T18:00:37.770279+08:00 [warning] Mnesia overload: {dump_log,time_threshold}
Then we restart node 10.63.118.104
/opt/soft/emqx/bin/emqx_ctl cluster status Cluster status: #{running_nodes => ['emqx@10.34.6.15','emqx@10.34.6.16','emqx@10.34.6.17', 'emqx@10.34.6.18','emqx@10.34.6.20','emqx@10.34.6.21', 'emqx@10.34.6.24','emqx@10.34.6.25','emqx@10.34.6.26', 'emqx@10.34.6.27','emqx@10.34.6.7', 'emqx@10.62.248.109','emqx@10.62.248.110', 'emqx@10.62.248.113','emqx@10.62.248.118', 'emqx@10.62.248.119','emqx@10.62.248.121', 'emqx@10.62.248.127','emqx@10.62.248.87', 'emqx@10.63.118.104','emqx@10.63.118.112', 'emqx@10.63.118.93'], stopped_nodes => []}
Incoming and Outgoing Messages were supposed to be roughly equal.
I see, thanks for the answer. It looks like the backplane network might have become overloaded. This could happen for a variety of reasons, depending on the pattern of traffic from the clients. Quick question: you mentioned that you have (3 core + 6 replicants) in two datacenters, i.e. 18 nodes in total. However, in the output of the cluster status command I counted 22 nodes. Do you know where do these extras come from?
Can you also collect system load report from a few nodes (1 core and 1 replicant in each DC) using the following instruction:
emqx remote_console
application:ensure_all_started(system_monitor). rr(system_monitor). timer:sleep(5100). rp(system_monitor:get_proc_top()).
core-dc1 core-dc1.txt
repl-dc1 repl-dc1.txt
core-dc2 core-dc2.txt
repl-dc2 repl-dc2.txt
Hello,
Sorry for the late answer. The logs provided indicate that these logs
2023-08-30T17:52:32.380265+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:52:32.380265+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381039+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:53:32.381023+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T17:54:32.382148+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T17:54:32.382124+08:00 [warning] Mnesia overload: {dump_log,time_threshold} 2023-08-30T18:00:37.041810+08:00 [warning] Mnesia('emqx@10.63.118.104'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold} 2023-08-30T18:00:37.770279+08:00 [warning] Mnesia overload: {dump_log,time_threshold}
Are likely unrelated to the problem. I don't see the signs of mnesia being overloaded to the point it de-syncs.
But this log message [error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection
indicates that the network connection between the two nodes is slow or possibly down. This message is printed by the Erlang runtime when the remote peer doesn't respond to the heartbeat messages in time.
To explain why incoming and outgoing rates are different: there could be two possibilities:
failed_to_kick_session_on_remote_node
points at this possibility as well.Hypothesis 1 can be checked by running the following command in the EMQX shell on one of the nodes (or maybe two different nodes in the two different DCs for good measure):
Nodes = [node() | nodes()].
erpc:call(Nodes, mria_rlog, status, []).
The second hypothesis can be substantiated by checking if any of the nodes contain error messages with the following taglines:
Erlang/OTP 24 [erts-12.3.2.2] [emqx] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
Restricted Eshell V12.3.2.2 (abort with ^G) v5.0.20(emqx@10.63.118.104)1> Nodes = [node() | nodes()]. ['emqx@10.63.118.104','emqx@10.34.6.21','emqx@10.34.6.16', 'emqx@10.62.248.118','emqx@10.34.6.18','emqx@10.62.248.110', 'emqx@10.34.6.20','emqx@10.34.6.25','emqx@10.62.248.121', 'emqx@10.34.6.26','emqx@10.34.6.15','emqx@10.62.248.127', 'emqx@10.62.248.87','emqx@10.63.118.93', 'emqx@10.62.248.113','emqx@10.34.6.24','emqx@10.63.118.112', 'emqx@10.34.6.27','emqx@10.62.248.119','emqx@10.62.248.109', 'emqx@10.34.6.17','emqx@10.34.6.7'] v5.0.20(emqx@10.63.118.104)2> erpc:call(Nodes, mria_rlog, status, []). ** exception error: {erpc,badarg} in function erpc:call/5 (erpc.erl, line 148) v5.0.20(emqx@10.63.118.104)3>
2. Checked all nodes without error messages like "more_than_one_channel_found | session_stepdown_request_timeout |session_stepdown_request_exception"
Thanks for checking.
RE: Cmd "erpc:call(Nodes, mria_rlog, status, [])" report exception:
Sorry my bad, the command should be erpc:multicall(Nodes, mria_rlog, status, []).
logs on two node( in the two different DCs ):
Thanks. It appears that the nodes are operating normally at this point, but the netsplit in the past might have affected up the state of the routing table. Usually I'd expect this to recover automatically.
A few more questions:
[error] ** Node 'emqx@10.63.118.104' not responding **, ** Removing (timedout) connection **
Do all error messages mention the exact same node ('emqx@10.63.118.104'), or there are other nodes as well?【Usually I'd expect this to recover automatically.】 Since the problem occurred, the system has been preserved until now. Although the nodes appear to be normal, the long connections that existed before the issue time have been unable to send and receive messages.
【Do the clients subscribe to topics with or without wildcards?】 Without wildcards
【 Do all error messages mention the exact same node】 Yes, only node 'emqx@10.63.118.104'
Looking forward to your reply. Is there any further information that needs to be collected to troubleshoot the problem? we will no longer retain the cluster after 2 days
Hello,
Thanks for all the information. You can recycle the cluster now. I think I have enough information to try and reproduce the routing table de-sync.
Hi @ieQu1, I'm using Emqx 5.3.1 with 3 node in cluster and recently I also had a problem with the message: failed_to_kick_session_on_remote_node
2024-05-23T16:24:38.287648+07:00 [error] msg: failed_to_kick_session_on_remote_node, mfa: emqx_cm:kick_session/3(503), peername: 17.23.148.238:60165, clientid: App_ios_164832_o95y3l5zhm , action: discard, error: error , node: 'mqtt2@123.301.148.15', reason: {badrpc,timeout}
Can you tell me what happen in my cluster and how to fix this error. Thanks
Hello @nguyenvanquan7826 ,
Your issue doesn't look related to the original issue. Please open a new one. Also, one log message is not enough to start investigation. It simply tells that a remote procedure call timed out. This could happen for any reason including network disturbance. We'll need full logs from all 3 nodes.
Hi @ieQu1 , I just open an issue at https://github.com/emqx/emqx/issues/13124 Please help me. Thanks!
What happened?
We have an emqx cluster, the version is Version:5.0.20. Deployed in two data centers, each with 3 core nodes + 8 repl nodes. This cluster has been running stably for more than 100+ days. Suddenly, one afternoon, all subscription messages can’t be received. The problem has continued until now. We have preserved the incident scene for the purpose of troubleshooting.
What did you expect to happen?
Work normally.
How can we reproduce it (as minimally and precisely as possible)?
Unknown
Anything else we need to know?
EMQX version
OS version
Log files