Netty Resource Leak observed with sorted queries leading to OutOfDirectMemoryError

chetanmeh commented 5 years ago

We are observing increased used of memory due to Netty resource leaks which is leading the process to fail with OutOfDirectMemoryError. In some of the setups we are observing Netty leak warnings like

[2018-11-15T17:59:45.025Z] [ERROR] LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.

At times the container crashes with following exception seen

[2018-11-16T11:08:34.576Z] [WARN] An exception was thrown by io.reactivex.netty.pipeline.ssl.SslCompletionHandler$1.operationComplete()
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 7147094023, max: 7158628352)
    at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
    at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
    at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
    at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
    at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)

Per Netty Leak Detection documentation this warning would be emitted if its resource leak detection logic (which works by sampling 1% allocations) detects a leak.

Key Observations

With some trial and error following observations can be made wrt leak

Its only being seen with sorted queries
Happens more prominently with data set involving multiple partitions. Probably its a function of number of simulatenous parallel connections being made for a query
Issue is not seen with unsorted query. This probably because such queries do not make connections to partitions in parallel (Azure/azure-cosmosdb-java#54)

Due to requirement of large partitions its bit hard to reflect in simple test case

Test Setup

This issue also has a PR #74 which adds a testcase NettyLeakTest which can either make use of existing data set or can create a synthetic data set and then perform multiple query both in sorted and non sorted mode. It makes use of Netty ResourceLeakDetector to track number of reported leaks via a custom RecordingLeakDetector. In ideal case the reported leak count should be zero. The test would also set -Dio.netty.leakDetection.level=PARANOID

It also tracks the Netty DIRECT_MEMORY_COUNTER which tracks the direct memory allocated by Netty. In case of leak this counter would report an upward trend. In ideal case it should eventually return to a base value. It reports in following format

==> Query Count - Direct Memory Counter [Query Result Count]
==>   0 - 134217735 [200]
==>   5 - 134217735 [200]
==>  10 - 134217735 [200]
==>  15 - 134217735 [200]

Where

First column is query count
Second column is amount of direct memory used
Third column is query result count

The leak reports are more pronounced when memory is constrained -Xmx200m

Netty leak detection logic also reports stack traces of recent access to leaked resource i.e. ByteBuf

2018-11-25 11:27:57,578       [rxnetty-nio-eventloop-3-5] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
#1:
    io.netty.handler.codec.http.DefaultHttpContent.release(DefaultHttpContent.java:94)
    io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.java:88)
    io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:90)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
    io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
    io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
    io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1389)
    io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1159)
    io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1203)
    io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
    io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
    io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    io.reactivex.netty.metrics.BytesInspector.channelRead(BytesInspector.java:59)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.reactivex.netty.pipeline.InternalReadTimeoutHandler.channelRead(InternalReadTimeoutHandler.java:108)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
    io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146)
    io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
    io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
    io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    java.lang.Thread.run(Thread.java:748)

Test runs

Run against existing dataset (sorted)

While running against our dataset (having 5 partitions now) -Xmx200m -ea -Dcosmosdb.useExistingDB=true -Dcosmosdb.dbName=<db name> following output can be seen

==> Query Count - Direct Memory Counter [Query Result Count]
==>   0 - 33554439 [200]
==>   5 - 33554439 [200]
==>  10 - 33554439 [200]
2018-11-25 11:12:31,924       [rxnetty-nio-eventloop-3-3] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
==>  15 - 33554439 [200]
==>  20 - 33554439 [200]
...
==>  45 - 33554439 [200]
==>  50 - 33554439 [200]
==>  55 - 33554439 [200]
==>  60 - 67108871 [200]
...
==> 105 - 67108871 [200]
==> 110 - 67108871 [200]
==> 115 - 83886087 [200]
==> 120 - 83886087 [200]
==> 125 - 100663303 [200]
...
==> 165 - 100663303 [200]
==> 170 - 100663303 [200]
==> 175 - 117440519 [200]
==> 180 - 134217735 [200]
==> 185 - 134217735 [200]
==> 190 - 134217735 [200]
==> 195 - 134217735 [200]

java.lang.AssertionError: [reported leak count] 
Expecting:
 <1>
to be equal to:
 <0>
but was not.

    at com.microsoft.azure.cosmosdb.rx.NettyLeakTest.queryAndLog(NettyLeakTest.java:80)
    at com.microsoft.azure.cosmosdb.rx.NettyLeakTest.queryDocumentsSortedManyTimes(NettyLeakTest.java:60)

Key points

Direct memory size can be seen as increasing. Eventually it can end up with OutOfDirectMemory
Here the backing dataset has 5 partitions

Run against test dataset (sorted)

While running against dataset created by test itself where it seeds 50000 records and then peform query against it.


2018-11-25 11:18:53,095       [RxComputationScheduler-1] INFO  com.microsoft.azure.cosmosdb.rx.internal.RxDocumentClientImpl - Getting database account endpoint from https://xxxd.documents.azure.com:443/
2018-11-25 11:18:56,830       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Seeding database with 50000 records
2018-11-25 11:19:00,668       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Inserted batch of 1000. 2% done
2018-11-25 11:19:03,394       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Inserted batch of 1000. 4% done
2018-11-25 11:21:12,166       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Inserted batch of 1000. 98% done
2018-11-25 11:21:14,900       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Inserted batch of 1000. 100% done
2018-11-25 11:21:14,900       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.NettyLeakTest - Created db with 50000 records
==> Query Count - Direct Memory Counter [Query Result Count]
==>   0 - 33554439 [200]
==>   5 - 33554439 [200]
...
==>  60 - 33554439 [200]
==>  65 - 33554439 [200]
==>  70 - 33554439 [200]
2018-11-25 11:21:50,529       [rxnetty-nio-eventloop-3-3] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
==>  75 - 33554439 [200]
...
==> 185 - 33554439 [200]
==> 190 - 33554439 [200]
==> 195 - 33554439 [200]

java.lang.AssertionError: [reported leak count] 
Expecting:
 <63>
to be equal to:
 <0>
but was not.

    at com.microsoft.azure.cosmosdb.rx.NettyLeakTest.queryAndLog(NettyLeakTest.java:80)
    at com.microsoft.azure.cosmosdb.rx.NettyLeakTest.queryDocumentsSortedManyTimes(NettyLeakTest.java:60)

Key points

Direct memory size does not show as increasing in this run
Here the backing dataset has 2 partitions
Leaks are still observed

Run against existing dataset (unsorted)

While running the

2018-11-25 11:30:00,098       [TestNG-method=beforeClass-1] INFO  com.microsoft.azure.cosmosdb.rx.internal.RxDocumentClientImpl - Initializing DocumentClient with serviceEndpoint [https://xxx.documents.azure.com:443/], ConnectionPolicy [ConnectionPolicy{requestTimeoutInMillis=60000, mediaRequestTimeoutInMillis=300000, connectionMode=Gateway, mediaReadMode=Buffered, maxPoolSize=1000, idleConnectionTimeoutInMillis=60000, userAgentSuffix='', retryOptions=RetryOptions{maxRetryAttemptsOnThrottledRequests=9, maxRetryWaitTimeInSeconds=30}, enableEndpointDiscovery=true, preferredLocations=null, usingMultipleWriteLocations=false, inetSocketProxyAddress=null}], ConsistencyLevel [Session]
2018-11-25 11:30:00,553       [RxComputationScheduler-1] INFO  com.microsoft.azure.cosmosdb.rx.internal.RxDocumentClientImpl - Getting database account endpoint from https://xxx.documents.azure.com:443/
==> Query Count - Direct Memory Counter [Query Result Count]
==>   0 - 33554439 [200]
==>   5 - 33554439 [200]
==>  10 - 33554439 [200]
==>  15 - 33554439 [200]
==>  20 - 33554439 [200]
...
==> 165 - 33554439 [200]
==> 170 - 33554439 [200]
==> 175 - 33554439 [200]
==> 180 - 33554439 [200]
==> 185 - 33554439 [200]
==> 190 - 33554439 [200]
==> 195 - 33554439 [200]
2018-11-25 11:33:00,832       [TestNG-method=afterClass-1] INFO  com.microsoft.azure.cosmosdb.rx.internal.GlobalEndpointManager - GlobalEndpointManager closed.

===============================================
Default Suite
Total tests run: 200, Failures: 0, Skips: 0
===============================================

Key points

No memory increase seen
No resource leak reported

Environment

SDK Version: 2.2.2, master branch
Java JDK version: jdk1.8.0_162
OS Version - MacOSX , Linux (Alpine)

chetanmeh commented 5 years ago

Tried with Netty 4.1.31.Final and issue still persist so looks like some issue with way ordered queries are executed

moderakh commented 5 years ago

thanks @chetanmeh for reporting this.

@mbhaskar could you please take a look?

mbhaskar commented 5 years ago

@chetanmeh Thanks for the PR . We could reproduce the above issue as per the testcase provided. We will get back to you after further analysis

moderakh commented 5 years ago

@chetanmeh, you did a complete diagnosis, thank you. We are looking at this.

moderakh commented 5 years ago

@chetanmeh could you please try this and let us know if it resolves the leak issue: https://github.com/Azure/azure-cosmosdb-java/blob/leak-tmp-fix/HOWTO.md

moderakh commented 5 years ago

@chetanmeh once you confirm the fix addressed the issue we can do the release to maven. Could you confirm please?

chetanmeh commented 5 years ago

@moderakh The change does seem to fix the issue for us and no leak is observed. For my understanding can you confirm the fix approach?

From the change it appears we are now relying on content subscription timeout to discard the response content if not subscribed to within 1 ms from time of receiving the response? If yes would this cause issue in slow setups (vm env) where genuine cases may see responses being discarded?

Some observations on HTTP calls made

Sorted

For fetching 200 top records total 9 HTTP calls were made (thus fetching 900 records) for our setup having 5 partitions and probably only response from 6-7 were consumed. So unconsumed response from the 2 calls was adding up to leak. With this fix ByteBuf for those responses would be now released.

UnSorted

For unsorted case only 4 calls were made and all response was consumed

moderakh commented 5 years ago

@simplynaveen20 will respond.

simplynaveen20 commented 5 years ago

@chetanmeh This content subscription timeout is the default value of RXNetty HttpClientResponse , we tested different scenario and its working fine , please let us know if you find any issue or odd behavior.

chetanmeh commented 5 years ago

@simplynaveen20 Yup its part of rxnetty default. My worry was more about the timeout being low like 1 ms which may cause issue on boxes with high CPU contention. For example in one the usage I see it being set to 5 secs

https://github.com/Netflix/rx-aws-java-sdk/blob/0cc14ddc3d9bdb299f5b5769d53cc83507a22d79/rx-aws-java-sdk-core/src/main/java/com/amazonaws/http/AmazonRxNettyHttpClient.java#L409-L413

I was trying to see if the unconsumed response can possibly be addressed (irrespective of timeout) by implementing onUnsubscribe in AsyncOnSubscribe created in Paginator where I believe we have access to last invoked request for a partition. But got lost in layers!

Anyways for now in our testing we do not see any issue. Would keep a watch and let you know if we observe any related issue

chetanmeh commented 5 years ago

Not observed this issue since updating to latest CosmosDB SDK 2.3.0. Resolving it as closed

Azure / azure-cosmosdb-java