aws / aws-sdk-java-v2

The official AWS SDK for Java - Version 2
Apache License 2.0
2.21k stars 853 forks source link

Netty MemoryRegionCache$Entry Objects Increasing Direct Memory Consumption #1301

Closed trevorhodges closed 5 years ago

trevorhodges commented 5 years ago

Expected Behavior

MemoryRegionCache$Entry objects should be garbage collected along with the child nioBuffer DirectByteBuffer objects, which should deallocate the corresponding direct memory.

Current Behavior

My application uses the AWS SDK to integrate SQS and S3. Running in Kubernetes, it has suffered a memory leak with the symptom that the container is killed with OOMKilled. I'm running the JVM with the -Xms and -Xmx arguments to 768M and giving the container a 1GB limit. Since the container is killed with OOMKilled rather than the application receiving an OutOfMemoryError exception from the JVM I believe the growth in memory usage is off heap.

Steps to Reproduce (for bugs)

The application is large and I don't have a paired down example of this occurring yet, but could potentially attempt to create one if needed.

Context

I've been investigating both with VisualVM and YourKit and have found that about every 5 minutes a new MemoryRegionCache$Entry object is created that contains a member nioBuffer of type DirectByteBuffer that's allocating 16MB off heap. I notice that these allocated MemoryRegionCache$Entry objects are not being collected by the GC, however, if they were I believe that the DirectByteBuffer and corresponding direct memory would be cleaned up / deallocated (based on this SO answer). So, why are they not being GC'd? I don't believe there is anything in my application that would require this data to exist after 5 minutes, and also don't believe that my application should have any ability to keep these objects alive as these are deeply internal to the SDK and Netty vs. my interaction using the SQS / S3 client.

Here's a screenshot showing the objects that exist after about 30 minutes of running my application:

image

And the corresponding DirectByteBuffer objects:

image

Here's the path from the GC root for one of the MemoryRegionCache$Entry objects:

image

I'm diving deeper into AWS's implementation of the Netty client to try and better understand the issue, but for now I have limited knowledge. Any ideas from the information I've provided, or is there any information missing that would help better troubleshoot?

Your Environment

SDK Version: aws-sdk-java 2.5.34 JVM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11, mixed mode) Java: version 1.8.0_171, vendor Oracle Corporation OS: OSX (Local) / Alpine Linux (Kubernetes hosted)

zoewangg commented 5 years ago

Are you creating a new client per request or using a single client throughout your application?

Could you check if there are error logs, something like ERROR io.netty.util.ResourceLeakDetector - LEAK:, in your application?

What is your request load? it is high TPS?

trevorhodges commented 5 years ago

I'm using a single client throughout my application.

I haven't seen any instances of ResourceLeakDetector ERRORs. My understanding is that these are reported when the corresponding ByteBuf is cleaned up yet the out of heap memory allocation still exists? It doesn't seem like this is the case for me. Also I should note that I'm using -Dio.netty.leakDetection.level=ADVANCED.

The request load is quite small: 2 messages every 5 minutes.

A bit more investigation shows that of the PoolThreadCache$MemoryRegionCache$Entry objects they are all owned by only 4 threads (one thread owns 8 of the buffers), and these threads all appear to still be alive and running. Perhaps the threads should have be stopped for some reason but haven't been?

Example thread stack trace:

image

zoewangg commented 5 years ago

Thank you for the info. I will run some tests. It'd be super helpful if you could provide a repro case.

trevorhodges commented 5 years ago

After continuing my investigation I discovered some more information.

My initial comment mentions that the direct memory should be released when garbage collection occurs. I now don't believe this is true as this memory is part of a pool? So, yes it should be "released", but released back into the Netty managed pool rather being deallocated altogether?

Also after digging deeper into the DirectByteBuffer objects I've now realized that most of them reference the same memory addresses. There are only 8 distinct 16MB memory address chunks allocated to these direct buffers. Is this in line with the number of memory allocations that would be allocated to a pool over time?

Due to the above information, despite my application having a memory leak, I'm now much less confident that the cause is continuous allocation / lack of deallocation of direct memory from AWS's use of Netty. It seems very possible that this is normal management of direct memory buffers within the pool.

zoewangg commented 5 years ago

I was not able to reproduce the issue with sqs client sending 10 requests per second; I didn't see increasing number of MemoryRegionCache$Entry instances in my tests.

The growing number of MemoryRegionCache$Entry is definitely odd. After doing some research, I think it might be related to the Netty Recycler, can you try disabling recycling by -Dio.netty.recycler.maxCapacity=0 to see if it helps?

I will continue to run more tests.

trevorhodges commented 5 years ago

I've been trying to reproduce the issue but don't seem to be able to. I'm not entirely sure why, but for now it's safe to close this issue rather than spending more time. If I see the issue repeat I'll re-open.

zoewangg commented 5 years ago

Hi, we have identified an issue within the SDK where it could cause excessive buffer usage and eventually OOM when using s3 async client to download a large object to a file. The fix #1335 is available in 2.7.4. Could you try with the latest version? Feel free to re-open if you continue to see the issue.

atilpaiMomentFeed commented 4 years ago

Are you creating a new client per request or using a single client throughout your application?

Could you check if there are error logs, something like ERROR io.netty.util.ResourceLeakDetector - LEAK:, in your application?

What is your request load? it is high TPS?

I am seeing this within my application. We are running a KinesisAsyncClient consumer with Netty at a rate of 100 batch size with 30seconds await time. Is this something you have seen before and have a fix for? My research till now has lead me to this Java option to limit direct byte buffer cache size : -Djdk.nio.maxCachedBufferSize=262144 Is this something that will work for me? And what about the Java option of the recycler mentioned here? Is that relevant to my case?

Error stacktrace: 2020-05-27 21:05:46.251 [aws-java-sdk-NettyEventLoop-0-3] ERROR io.netty.util.ResourceLeakDetector(320) - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: Created at: io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:96) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115) io.netty.handler.ssl.SslHandler.allocate(SslHandler.java:2151) io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1333) io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1233) io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1280) io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498) io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437) io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355) io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355) io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) java.lang.Thread.run(Thread.java:748)

We are using akka-stream-alpakka-kinesis connector

JDK 8 JVM options: -XX:+UseConcMarkSweepGC -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MinRAMPercentage=25.0 -XX:MaxRAMPercentage=75.0

atilpaiMomentFeed commented 4 years ago

@zoewangg if you could help us understand. Do you think this is something to do with our service implementation or something to do with Netty client itself? Like do the configs need to be tuned better?

zoewangg commented 4 years ago

Hi @atilpaiMomentFeed, could you create a new issue? Easier to track this way.

atilpaiMomentFeed commented 4 years ago

https://github.com/aws/aws-sdk-java-v2/issues/1867 I have created one

zoewangg commented 4 years ago

@atilpaiMomentFeed Oops, sorry, didn't notice there's one created already. Will look into it.

atilpaiMomentFeed commented 4 years ago

Thanks @zoewangg