alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.38k stars 12.86k forks source link

Nacos 2.1.0 集群,Dubbo注册实例数超过41K后,经常出现个别节点上实例数不一致,疑似堆外内存泄露 #12792

Open goto456 opened 4 weeks ago

goto456 commented 4 weeks ago

环境描述

Nacos 版本 2.1.0 11节点集群,各个节点配置是:16核32G 注册实例数:41K(Dubbo注册实例) 相关内存分配:-Xms12g -Xmx12g -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=320m -XX:MaxDirectMemorySize=12g

问题描述

集群出现个别节点实例数不一致(比其他节点少20K~30K),重启问题节点后恢复正常,但是运行几天之后又会出现实例数不一致问题。如下图所示:

image

查看问题节点上的protocol-distro.log日志,出现如下异常:

2024-10-17 05:06:55,492 WARN [DISTRO] Sync data change failed.
io.grpc.StatusRuntimeException: UNKNOWN
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:497)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:521)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:641)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:529)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:703)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:692)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 12868124951, max: 12884901888)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:725)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:680)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:758)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:734)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocate(PoolArena.java:227)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.reallocate(PoolArena.java:397)
        at io.grpc.netty.shaded.io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:119)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:310)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:281)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1118)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1111)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1102)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:96)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:281)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 common frames omitted

从机器内存监控上看,机器的可用内存一直在下降:

image

尝试解决的过程:

  1. 按照nacos官方文档的建议配置(https://nacos.io/blog/faq/nacos-user-question-history12682/) ,将JVM堆内存(-Xmx)配置为22G,MaxDirectMemorySize配置为7G之后,还是会出现上述问题,且日志中还是会出现上述异常。

  2. 搜到类似的issue(#10352, #10854 ) ,修改调整回收堆外内存频率-Dio.grpc.netty.shaded.io.netty.allocation.cacheTrimIntervalMillis=60000(由默认的2min改为1min),之后依旧会出现上述问题,且日志中还是会出现上述异常。

其他异常日志

有问题的节点的nacos.log中出现很多如下日志

WARN Exception in onHeadersRead()
java.util.concurrent.RejectedExecutionException: Task io.grpc.internal.SerializingExecutor@4a6ee6d0 rejected from java.util.concurrent.ThreadPoolExecutor@1ebb611[Running, pool size = 256, active threads = 0, queued tasks = 16369, completed tasks = 13013149647]
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at io.grpc.internal.SerializingExecutor.schedule(SerializingExecutor.java:93)
        at io.grpc.internal.SerializingExecutor.execute(SerializingExecutor.java:86)
        at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.streamCreatedInternal(ServerImpl.java:588)
        at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.streamCreated(ServerImpl.java:476)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler.onHeadersRead(NettyServerHandler.java:450)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler.access$900(NettyServerHandler.java:101)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler$FrameListener.onHeadersRead(NettyServerHandler.java:807)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onHeadersRead(DefaultHttp2ConnectionDecoder.java:373)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onHeadersRead(Http2InboundFrameLogger.java:65)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader$1.processFragment(DefaultHttp2FrameReader.java:457)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readHeadersFrame(DefaultHttp2FrameReader.java:464)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:254)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:174)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
KomachiSion commented 3 weeks ago

可以升级版本后重试一下, 新版本之前对数据推送做过优化, 发现推送有问题的连接,不会一直发起推送,会先断开连接释放资源,重连后再继续推送。

kabiyangyang commented 3 weeks ago

可以仔细讲一讲这个的具体原因吗 我也碰到一模一样的问题了 @KomachiSion

woshioosm commented 3 weeks ago

@KomachiSion "可以升级版本后重试一下, 新版本之前对数据推送做过优化, 发现推送有问题的连接,不会一直发起推送,会先断开连接释放资源,重连后再继续推送。" 同样碰到 有具体解决的issue 或者相关 commit 可以发一下么?

KomachiSion commented 3 weeks ago

内容比较复杂, 主要就是grpc 对 堆外内存的利用方式分为全局缓存和线程缓存, 如果某个客户端一直不回包(tcp包),nio的可使用窗口会一直为0, 导致grpc发送的数据一直缓存在堆外内存中,等待tcp可用,一直会到tcp可用,把请求正常发出去,或tcp连接断开,才会释放对应堆外内存;

当出现一个连接出现故障,比如假死,网络故障,导致grpc前面的请求发出去但一直无法收到tcp的ack时,后续的请求会全部缓存在tcp和堆外内存队列中, 导致这个链接对应的线程缓存占满, 然后开始申请全局缓存。

对应的现象就是堆外内存一直涨,直到所有的堆外内存都被用完且无法释放, 新的请求出现DirectOOM。

在2.3还是2.2的最后几个版本,Nacos在GrpcClient里做了一点优化, 如果这个链接一直处于tcp窗口为0的情况,主动断开链接,重新链接后再进行数据推送。

具体哪个PR已经不记得了, 只知道最新版本应该解决过这个问题。可以试一下。

KomachiSion commented 3 weeks ago

上面这个问题很有可能是因为dubbo订阅了非常多provider, 然后consumer的客户端出现GC之类的问题,导致nacos推送的数据无法处理,不回执tcp ack, 导致上面描述的问题出现, 进而影响到其他的链接和数据同步。

可以试着分析一下naming-push.log. 看下是不是某几个ip consumer,一直处于推送失败(推送超时)的状态。 单独重启这一些consumer,也可以释放这部分内存

luky116 commented 2 weeks ago

GrpcClient

大佬,我们也遇到类似的问题,我们用的是 2.1.2 版本的 nacos。 想请教下, 你说的优化 GrpcClient 是哪个类呢?我想看看改动原理,了解下根因。谢谢!

KomachiSion commented 2 weeks ago

GrpcConnection ConnectionBusyException