alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.14k stars 12.82k forks source link

io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006633239, max: 1020067840) #10352

Closed little-cui closed 1 year ago

little-cui commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

3台4U8G机器的Nacos集群,部署了大概500个spring-cloud-alibaba框架微服务进程,然后出现部分实例反复上下线的问题。麻烦确认下

客户端日志

Caused by: com.alibaba.nacos.api.exception.NacosException: java.util.concurrent.TimeoutException: Waited 3000 milliseconds (plus 1 milliseconds, 66215 nanoseconds delay) for com.alibaba.nacos.shaded.io.grpc.stub.ClientCalls$GrpcFuture@4145bad8[status=PENDING, info=[GrpcFuture{clientCall={delegate={delegate=ClientCallImpl{method=MethodDescriptor{fullMethodName=Request/request, type=UNARY, idempotent=false, safe=false, sampledToLocalTracing=true, requestMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@4b706717, responseMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@1b34115f, schemaDescriptor=com.alibaba.nacos.api.grpc.auto.RequestGrpc$RequestMethodDescriptorSupplier@7a712b18}}}}}]]

Nacos后端的日志 naming-server.log 反复打印:

2023-04-20 06:36:37,295 ERROR [TASK-FAILED] io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006633239, max: 1020067840)

io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006633239, max: 1020067840)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:725)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:680)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:758)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:734)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocate(PoolArena.java:227)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocate(PoolArena.java:147)
        at io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:342)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:123)
        at io.grpc.netty.shaded.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
        at io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:227)
        at io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
        at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
        at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:53)
        at io.grpc.internal.ServerCallImpl.sendMessageInternal(ServerCallImpl.java:167)
        at io.grpc.internal.ServerCallImpl.sendMessage(ServerCallImpl.java:149)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:351)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.sendRequestNoAck(GrpcConnection.java:62)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.sendRequestInner(GrpcConnection.java:94)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.asyncRequest(GrpcConnection.java:117)
        at com.alibaba.nacos.core.remote.RpcPushService.pushWithCallback(RpcPushService.java:55)
        at com.alibaba.nacos.naming.push.v2.executor.PushExecutorRpcImpl.doPushWithCallback(PushExecutorRpcImpl.java:50)
        at com.alibaba.nacos.naming.push.v2.executor.PushExecutorDelegate.doPushWithCallback(PushExecutorDelegate.java:52)
        at com.alibaba.nacos.naming.push.v2.task.PushExecuteTask.run(PushExecuteTask.java:68)
        at com.alibaba.nacos.common.task.engine.TaskExecuteWorker$InnerWorker.run(TaskExecuteWorker.java:116)
2023-04-20 06:36:37,506 INFO Client connection 1681972576310_198.19.131.24_3061 disconnect, remove instances and subscribers

2023-04-20 06:36:37,610 INFO Client change for service Service{namespace='public', group='testGroup', name='nacos-provider', ephemeral=true, revision=29330}, 1681972596741_198.19.131.45_3006

2023-04-20 06:36:37,651 INFO Client connection 1681972597645_198.19.131.15_1078 connect

Nacos后端的日志 nacos.log 反复打印:

2023-04-20 07:11:31,652 WARN Stream Error

io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:167)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:356)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1000)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:956)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:512)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:518)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.closeStream(Http2ConnectionHandler.java:613)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:408)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:516)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:260)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:174)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

Expected behavior A clear and concise description of what you expected to happen.

Actually behavior A clear and concise description of what you actually to happen.

How to Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See errors

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

KomachiSion commented 1 year ago

direct memory (used: 1006633239, max: 1020067840) 用超了,可以设置一下堆外内存的回收时间试一下

-Dio.grpc.netty.shaded.io.netty.allocation.cacheTrimIntervalMillis=

默认应该是2min

little-cui commented 1 year ago

direct memory (used: 1006633239, max: 1020067840) 用超了,可以设置一下堆外内存的回收时间试一下

-Dio.grpc.netty.shaded.io.netty.allocation.cacheTrimIntervalMillis=

默认应该是2min

  1. 有没有关于这块“实例变化推送”的机制设计材料可以学习下?没理解怎么使用grpc和netty和为什么netty没有回收直接内存
little-cui commented 1 year ago

仔细读代码发现,异常场景下closeBiStream() throw异常会漏把netty链接关闭,可能导致直接内存泄漏 com.alibaba.nacos.core.remote.grpc.GrpcConnection#close

2023-04-24 14:51:59,163 WARN [1682347817385_198.19.131.24_2976] connection  close exception  : {}

java.lang.IllegalStateException: call already closed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
        at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:209)
        at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:202)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:371)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.closeBiStream(GrpcConnection.java:143)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.close(GrpcConnection.java:131)
        at com.alibaba.nacos.core.remote.ConnectionManager.unregister(ConnectionManager.java:230)
KomachiSion commented 1 year ago

仔细读代码发现,异常场景下closeBiStream() throw异常会漏把netty链接关闭,可能导致直接内存泄漏 com.alibaba.nacos.core.remote.grpc.GrpcConnection#close

2023-04-24 14:51:59,163 WARN [1682347817385_198.19.131.24_2976] connection  close exception  : {}

java.lang.IllegalStateException: call already closed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
        at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:209)
        at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:202)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:371)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.closeBiStream(GrpcConnection.java:143)
        at com.alibaba.nacos.core.remote.grpc.GrpcConnection.close(GrpcConnection.java:131)
        at com.alibaba.nacos.core.remote.ConnectionManager.unregister(ConnectionManager.java:230)

应该不会, 这个错误是说连接重复进行关闭了,说明之前已经关闭了。

如果这个会有内存泄漏, 需要提给grpc修复,然后我们再升级版本。

至少在我们的环境中,目前还没有遇到过这个导致的内存溢出问题。关键还是控制堆外内存回收的频率和推送的tps等

KomachiSion commented 1 year ago

No more response from author, I think this is a env problem.

LongCodeShort commented 1 year ago

could you tell me how to fix this? i get same question!