After using Nacos to rebuild monthly nodes on the Google Cloud platform, there are occasional instances of continuous service unavailability.

zhangchangb commented 2 weeks ago

Based on Google Cloud Platform:

Nodes are automatically rebuilt each month. Deployed using Helm, scaling via nacos-peer-finder-plugin. Three pods in the cluster. Nacos client accesses via service name. No local data persistence. Data stored using MySQL. Nacos server version 2.0.2 Occasionally, there are widespread service unavailability issues lasting a long time after node rebuilding. Nacos clients report errors like "no instance in service."

Google Cloud Platform每个月会自动重建节点，偶尔出现服务大面积不可用的情况，而且持续很久，nacos client报错：no instance in service，有遇到类似的问题吗？

基本信息 1、基于Google Cloud Platform部署，每个月会自动重建节点； 2、通过helm的方式，通过nacos-peer-finder-plugin来扩容； 3、集群有三个pod； 4、nacos client通过service name访问集群 5、本地数据没有做持久化操作 6、数据存储使用mysql 7、nacos server版本2.0.2

KomachiSion commented 2 weeks ago

问题没有描述清楚，是Nacos的节点自动重建，还是应用节点。

不过根据已有描述，大概是你部署的Platform可能有不正确的地方，比如jvm参数设置超过了pod的memory limit，导致被系统oomKill了。

而且自动重建一般都是liveness不通过了， liveness不通过有可能是应用内部挂了（比如系统kill了进程），或者FullGC之类的。建议自行排查一下。

zhangchangb commented 2 weeks ago

自动重建是云厂商的策略，每个月都会自动重建所有应用的节点，包括nacos，然后重新把服务拉起来，但是不能保证先重建应用还是nacos中间件。但是不是每次重建都会导致nacos集群不可用的情况，出现问题之后当天所有交易都不能正常进行，手动重启之后恢复正常。其他中间件mysql、redis等都是云厂商提供的，不存在问题。

KomachiSion commented 1 week ago

那还是要看下云产品重建和你们重建有什么区别，肯定是环境中某个内容变了，因为仅是重启server 节点，只要网络是正常的，环境是正常的，最终会数据会保持一致，且Nacos主要是AP协议，会优先保证访问的可用性。会大面积长时间保持不可用的话，大概率就是环境上或者配置上有问题导致的。

alibaba / nacos

After using Nacos to rebuild monthly nodes on the Google Cloud platform, there are occasional instances of continuous service unavailability. #12240