Open joni-jones opened 1 year ago
I believe akka 2.6.20 is the first release under the non-open source BSL license, not Apache v2. Therefore changes to update OpenWhisk to akka 2.6.20 cannot be accepted by the Apache OpenWhisk project.
@dgrove-oss 2.6.20 is still Apache. >2.7.x is BSL. They actually released another patch a couple months ago 2.6.21 to fix a TLS bug.
Apache Pekko has started doing official releases over the last month. Once we get on to 2.6.20 we can start discussing migrating the project to Pekko. So far the core modules, http, and kafka have been released. They’re about to do management and then the rest of the connectors. I think there should be releases for everything by September at the pace they’re going.
For the topic of this memory leak, more information is needed. Is the memory leak only with 2.6.20? Can you reproduce off master? Are you using the new scheduler which uses the v2 FPCInvoker or the original invokers?
Cool, @bdoyle0182 thanks for clarifying. I had found an old post that said 2.6.19 was the last Apache version and 2.6.20 and beyond were going to be BSL.
A strategy of getting to the most recent Apache licensed version from Lightbend and then switching to Pekko sounds right to me.
@bdoyle0182 we are migrating our project from Akka 2.5.26
, on this version there is no memory leak. As our project has some slight modifications to the OpenWhisk, I'm not able to use the OpenWhisk master branch to run the same load and collect heap dumps. We use original invokers.
Apache Pekko, a fork of Akka 2.6 has been released. v1.0.1 is out - very similar to Akka 2.6.21.
https://pekko.apache.org/docs/pekko/current/project/migration-guides.html
@joni-jones Is there any chance you provide a self contained reproducer?
If you want to raise a Pekko issue about this, someone may be able to help.
Since the strings are all IP addresses and it is below the stream materializer, this could be incoming connections that are hanging / not cleaned up (without knowing anything about openwhisk). Hard to say without knowing anything about the setup.
@jrudolph I'm looking at these graphs and strings with IPs having 0% in comparison to RedBlackTree
allocation. But I'm still looking if it could be an issue.
I see that these RedBlackTree
have flow-*-0-ignoreSink
as a value.
What you are probably looking at is the child actors of the materializer actor where one actor is spawned for every stream you run. So, it might be a bit hard to see what the actual issue is because the memory might be spread over all these actors. One way to go about it would be to see the output of a class histogram just over the elements referenced by that children tree and see what kind of data is in there.
Thanks @jrudolph. Yes, I tried to go down through these trees and leaves are pointing to child actors and ignore-sink
.
I don't know if it's related, but some time ago when OpenWhisk was upgraded from 2.5.x Akka to 2.6.12 and the actor materialized has been removed there was a materializer.shutdown()
https://github.com/apache/openwhisk/pull/5065/files#diff-e0bd51cbcd58c3894e1ffa4894de22ddfd47ae87352912de0e30cd60db315758L131-R130. I don't know all the internals of Materializer, but if such method was used to destroy all related actors is it possible that after it being removed on connection.shutdown
some actors might hang up?
The version that we are upgrading from still uses 2.5.x Akka and we don't have issues with memory there.
It seems the issue in https://github.com/apache/openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/http/PoolingRestClient.scala#L76, without materializer.shutdown()
removed by Akka upgrade to 2.6.12 it leaks memory. Also, OverflowStrategy.dropNew
has been deprecated in 2.6.11, and underneath the queue for the same behavior has been changed from SourceQueueWithComplete
to BoundedSourceQueueStage
which looks like without proper clean up of materialized resources doesn't free up the memory.
In our implementation, we use a wrapper on top of PoolingRestClient
for HTTP communication between invokers and actions pods instead of OW ApacheBlockingContainerClient
.
I did a couple of different implementations, including:
OverflowStrategy.dropHead
to continue using SourceQueueWithComplete
instead of the new BoundedSourceQueueStage
with extra logic on shutdown
and no memory leaks were observed.OverflowStrategy.dropNew
with no changes for shutdown
seems to be leaking memory.BoundedSourceQueueStage
but with proper clean up on shutdown
by using KillSwitch
and queue.complete
seems to be working fine as well with no memory issues.@joni-jones Thanks for sharing the update.
It looks like I was able to fix the memory leak and it was stable on our production so far.
I will be working on the PR shortly, as I believe it happens due to improper resource cleanup in PoolingRestClient
.
Summary
I'm working on upgrading OpenWhisk to Akka 2.6.20 and Scala 2.13 and experienced the issue with OpenWhisk invokers consuming all available g1-old heap size after running for a couple of days with active traffic.
Doing heap profiling I got the following suggestions from Heap Hero:
Further analysis with Eclipse Memory Analyzer shows the following:
Environment details:
Any suggestions on where I should look to find the root cause of this memory leak?