alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
29.85k stars 12.75k forks source link

Communication error between Nacos client and a node in the cluster #12337

Closed CCTV-six closed 4 weeks ago

CCTV-six commented 1 month ago

Describe the bug com.alibaba.nacos.api.exception.NacosException: Server is DOWN now, please try again later!

Expected behavior The Nacos client should be able to automatically identify and bypass unavailable service nodes, and retry to other healthy nodes

How to Reproduce Steps to reproduce the behavior:

  1. Build a Nacos cluster containing multiple nodes.
  2. Configure network simulation tools (such as tcpdump, iptables, etc.) to simulate network partitioning scenarios and disconnect network connections between certain client nodes and some Nacos service nodes.
  3. Start one or more Nacos clients, configure them to connect to the Nacos cluster, and attempt configuration read or update operations.
  4. Observe and record the behavior of clients during network partitioning, especially how they handle connection failures and retry logic.
KomachiSion commented 1 month ago

Server is DOWN means your nacos cluster has follow two problem:

  1. raft cluster vote leader failed or state machine in nacos node has error. Please see alipay-jraft.log to find out problem and fix it. This problem will casue whole cluster Down which means client retry also the this error.
  2. distro protocol can't get snapshot from other nodes, this problem will appear when node network problem. And client retry will connect other nodes to do request, (otherwise your all nacos node has this problem).

From you description , we don't know which version you used and no key point log information. Only can suggest you:

  1. check alipay-jraft.log to find out whether the leader can't be vote in cluster.
  2. check whether all node in cluster can't get snapshot from other node( only network problem), see protocal-distro.log
  3. upgrade newest version can retry.
KomachiSion commented 4 weeks ago

No more response from author for a long time, and this problem seems env problem.