Closed chetanmeh closed 5 years ago
Tried with Netty 4.1.31.Final and issue still persist so looks like some issue with way ordered queries are executed
thanks @chetanmeh for reporting this.
@mbhaskar could you please take a look?
@chetanmeh Thanks for the PR . We could reproduce the above issue as per the testcase provided. We will get back to you after further analysis
@chetanmeh, you did a complete diagnosis, thank you. We are looking at this.
@chetanmeh could you please try this and let us know if it resolves the leak issue: https://github.com/Azure/azure-cosmosdb-java/blob/leak-tmp-fix/HOWTO.md
@chetanmeh once you confirm the fix addressed the issue we can do the release to maven. Could you confirm please?
@moderakh The change does seem to fix the issue for us and no leak is observed. For my understanding can you confirm the fix approach?
From the change it appears we are now relying on content subscription timeout to discard the response content if not subscribed to within 1 ms from time of receiving the response? If yes would this cause issue in slow setups (vm env) where genuine cases may see responses being discarded?
Some observations on HTTP calls made
For fetching 200 top records total 9 HTTP calls were made (thus fetching 900 records) for our setup having 5 partitions and probably only response from 6-7 were consumed. So unconsumed response from the 2 calls was adding up to leak. With this fix ByteBuf
for those responses would be now released.
For unsorted case only 4 calls were made and all response was consumed
@simplynaveen20 will respond.
@chetanmeh This content subscription timeout is the default value of RXNetty HttpClientResponse , we tested different scenario and its working fine , please let us know if you find any issue or odd behavior.
@simplynaveen20 Yup its part of rxnetty default. My worry was more about the timeout being low like 1 ms which may cause issue on boxes with high CPU contention. For example in one the usage I see it being set to 5 secs
I was trying to see if the unconsumed response can possibly be addressed (irrespective of timeout) by implementing onUnsubscribe
in AsyncOnSubscribe
created in Paginator
where I believe we have access to last invoked request for a partition. But got lost in layers!
Anyways for now in our testing we do not see any issue. Would keep a watch and let you know if we observe any related issue
Not observed this issue since updating to latest CosmosDB SDK 2.3.0. Resolving it as closed
We are observing increased used of memory due to Netty resource leaks which is leading the process to fail with
OutOfDirectMemoryError
. In some of the setups we are observing Netty leak warnings likeAt times the container crashes with following exception seen
Per Netty Leak Detection documentation this warning would be emitted if its resource leak detection logic (which works by sampling 1% allocations) detects a leak.
Key Observations
With some trial and error following observations can be made wrt leak
Due to requirement of large partitions its bit hard to reflect in simple test case
Test Setup
This issue also has a PR #74 which adds a testcase
NettyLeakTest
which can either make use of existing data set or can create a synthetic data set and then perform multiple query both in sorted and non sorted mode. It makes use of Netty ResourceLeakDetector to track number of reported leaks via a customRecordingLeakDetector
. In ideal case the reported leak count should be zero. The test would also set-Dio.netty.leakDetection.level=PARANOID
It also tracks the Netty DIRECT_MEMORY_COUNTER which tracks the direct memory allocated by Netty. In case of leak this counter would report an upward trend. In ideal case it should eventually return to a base value. It reports in following format
Where
The leak reports are more pronounced when memory is constrained
-Xmx200m
Netty leak detection logic also reports stack traces of recent access to leaked resource i.e.
ByteBuf
Test runs
Run against existing dataset (sorted)
While running against our dataset (having 5 partitions now)
-Xmx200m -ea -Dcosmosdb.useExistingDB=true -Dcosmosdb.dbName=<db name>
following output can be seenKey points
Run against test dataset (sorted)
While running against dataset created by test itself where it seeds 50000 records and then peform query against it.
Key points
Run against existing dataset (unsorted)
While running the
Key points
Environment