joni-jones commented 1 year ago

Summary

I'm working on upgrading OpenWhisk to Akka 2.6.20 and Scala 2.13 and experienced the issue with OpenWhisk invokers consuming all available g1-old heap size after running for a couple of days with active traffic.

Doing heap profiling I got the following suggestions from Heap Hero:

One instance of akka.actor.LocalActorRef loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018
occupies 20,136,784 (18.14%) bytes.
The memory is accumulated in one instance of scala.collection.immutable.RedBlackTree$Tree,
loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018, which occupies 20,132,728 (18.14%) bytes.

Further analysis with Eclipse Memory Analyzer shows the following:

Environment details:

Scala 2.13
Akka 2.6.20
Akka HTTP 10.2.10
Akka Management 1.1.4

Any suggestions on where I should look to find the root cause of this memory leak?

dgrove-oss commented 1 year ago

I believe akka 2.6.20 is the first release under the non-open source BSL license, not Apache v2. Therefore changes to update OpenWhisk to akka 2.6.20 cannot be accepted by the Apache OpenWhisk project.

bdoyle0182 commented 1 year ago

@dgrove-oss 2.6.20 is still Apache. >2.7.x is BSL. They actually released another patch a couple months ago 2.6.21 to fix a TLS bug.

Apache Pekko has started doing official releases over the last month. Once we get on to 2.6.20 we can start discussing migrating the project to Pekko. So far the core modules, http, and kafka have been released. They’re about to do management and then the rest of the connectors. I think there should be releases for everything by September at the pace they’re going.

For the topic of this memory leak, more information is needed. Is the memory leak only with 2.6.20? Can you reproduce off master? Are you using the new scheduler which uses the v2 FPCInvoker or the original invokers?

dgrove-oss commented 1 year ago

Cool, @bdoyle0182 thanks for clarifying. I had found an old post that said 2.6.19 was the last Apache version and 2.6.20 and beyond were going to be BSL.

A strategy of getting to the most recent Apache licensed version from Lightbend and then switching to Pekko sounds right to me.

joni-jones commented 1 year ago

@bdoyle0182 we are migrating our project from Akka 2.5.26, on this version there is no memory leak. As our project has some slight modifications to the OpenWhisk, I'm not able to use the OpenWhisk master branch to run the same load and collect heap dumps. We use original invokers.

pjfanning commented 1 year ago

Apache Pekko, a fork of Akka 2.6 has been released. v1.0.1 is out - very similar to Akka 2.6.21.

https://pekko.apache.org/docs/pekko/current/project/migration-guides.html

He-Pin commented 1 year ago

@joni-jones Is there any chance you provide a self contained reproducer?

pjfanning commented 1 year ago

If you want to raise a Pekko issue about this, someone may be able to help.

https://github.com/apache/incubator-pekko

jrudolph commented 1 year ago

Since the strings are all IP addresses and it is below the stream materializer, this could be incoming connections that are hanging / not cleaned up (without knowing anything about openwhisk). Hard to say without knowing anything about the setup.

joni-jones commented 1 year ago

@jrudolph I'm looking at these graphs and strings with IPs having 0% in comparison to RedBlackTree allocation. But I'm still looking if it could be an issue.

I see that these RedBlackTree have flow-*-0-ignoreSink as a value.

jrudolph commented 1 year ago

What you are probably looking at is the child actors of the materializer actor where one actor is spawned for every stream you run. So, it might be a bit hard to see what the actual issue is because the memory might be spread over all these actors. One way to go about it would be to see the output of a class histogram just over the elements referenced by that children tree and see what kind of data is in there.

joni-jones commented 1 year ago

Thanks @jrudolph. Yes, I tried to go down through these trees and leaves are pointing to child actors and ignore-sink.

I don't know if it's related, but some time ago when OpenWhisk was upgraded from 2.5.x Akka to 2.6.12 and the actor materialized has been removed there was a materializer.shutdown() https://github.com/apache/openwhisk/pull/5065/files#diff-e0bd51cbcd58c3894e1ffa4894de22ddfd47ae87352912de0e30cd60db315758L131-R130. I don't know all the internals of Materializer, but if such method was used to destroy all related actors is it possible that after it being removed on connection.shutdown some actors might hang up?

The version that we are upgrading from still uses 2.5.x Akka and we don't have issues with memory there.

joni-jones commented 1 year ago

It seems the issue in https://github.com/apache/openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/http/PoolingRestClient.scala#L76, without materializer.shutdown() removed by Akka upgrade to 2.6.12 it leaks memory. Also, OverflowStrategy.dropNew has been deprecated in 2.6.11, and underneath the queue for the same behavior has been changed from SourceQueueWithComplete to BoundedSourceQueueStage which looks like without proper clean up of materialized resources doesn't free up the memory.

In our implementation, we use a wrapper on top of PoolingRestClient for HTTP communication between invokers and actions pods instead of OW ApacheBlockingContainerClient.

I did a couple of different implementations, including:

Use OverflowStrategy.dropHead to continue using SourceQueueWithComplete instead of the new BoundedSourceQueueStage with extra logic on shutdown and no memory leaks were observed.
Continuing using OverflowStrategy.dropNew with no changes for shutdown seems to be leaking memory.
Use of the queue with BoundedSourceQueueStage but with proper clean up on shutdown by using KillSwitch and queue.complete seems to be working fine as well with no memory issues.

He-Pin commented 1 year ago

@joni-jones Thanks for sharing the update.

joni-jones commented 1 year ago

It looks like I was able to fix the memory leak and it was stable on our production so far. I will be working on the PR shortly, as I believe it happens due to improper resource cleanup in PoolingRestClient.

apache / openwhisk

Memory leak in `akka.actor.LocalActorRef` #5431

Summary

Environment details: