apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.56k stars 1.17k forks source link

Memory leak in `akka.actor.LocalActorRef` #5442

Closed joni-jones closed 1 year ago

joni-jones commented 1 year ago

Description

Note: In our service, we use PoolingRestClient for communication between invokers and action pods and memory leak appears for long-running asynchronous invocations as it seems that objects allocated within streams are not clean properly after Akka materialization has been removed.

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

dgrove-oss commented 1 year ago

I approved running the CI tests. Changes looked plausible to me, but I'm not an Akka expert.

joni-jones commented 1 year ago

I'm looking into failed tests due to NullPointerException on sinkCompletion.onComplete step.

joni-jones commented 1 year ago

It looks like the NullPointerException appears due to CouchDbRestClient redefining the execution context.

joni-jones commented 1 year ago

I changed the code to be able to propagate the right context.

codecov-commenter commented 1 year ago

Codecov Report

Merging #5442 (d084a4f) into master (20f7d98) will decrease coverage by 25.91%. Report is 1 commits behind head on master. The diff coverage is 75.00%.

:exclamation: Current head d084a4f differs from pull request most recent head 3fb2fc5. Consider uploading reports for the commit 3fb2fc5 to get more accurate results

@@             Coverage Diff             @@
##           master    #5442       +/-   ##
===========================================
- Coverage   76.82%   50.91%   -25.91%     
===========================================
  Files         241      241               
  Lines       14632    14630        -2     
  Branches      616      606       -10     
===========================================
- Hits        11241     7449     -3792     
- Misses       3391     7181     +3790     
Files Changed Coverage Δ
...whisk/core/containerpool/AkkaContainerClient.scala 44.23% <ø> (-26.93%) :arrow_down:
.../containerpool/logging/ElasticSearchLogStore.scala 0.00% <0.00%> (-92.86%) :arrow_down:
...ontainerpool/logging/ElasticSearchRestClient.scala 8.88% <ø> (-68.89%) :arrow_down:
.../org/apache/openwhisk/http/PoolingRestClient.scala 87.09% <80.00%> (-3.82%) :arrow_down:
...he/openwhisk/core/database/CouchDbRestClient.scala 70.83% <100.00%> (-16.67%) :arrow_down:

... and 149 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

bdoyle0182 commented 1 year ago

LGTM as well

falkzoll commented 1 year ago

Hi @joni-jones , maybe you or someone else here in this PR can help us to understand an issue in the https://github.com/apache/openwhisk-runtime-nodejs runtime... actually the scheduled openwhisk nodejs runtime builds fail (https://github.com/apache/openwhisk-runtime-nodejs/actions). They fail with a timeout (after 30s) in a certain testcase that tests the concurrent invocation capability (number of concurrent invokes/runs for a single action container) of this runtime (for nodejs:18 and nodejs:20).

runtime.actionContainers.NodeJs18ConcurrentTests > action-nodejs-v18 should allow running activations concurrently FAILED
    java.util.concurrent.TimeoutException
        at org.apache.openwhisk.core.containerpool.AkkaContainerClient$.$anonfun$executeRequest$1(AkkaContainerClient.scala:252)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Same for nodejs:20. This kind of concurrency is actually only supported in the nodejs runtimes. Other runtimes just support '__OW_ALLOW_CONCURRENT=false' and can only handle one invoke/run per action container at a time.

To run the tests, the github action of the nodejs runtime clones the latest available apache/openwhisk repository and uses its master as the base to run its tests (https://github.com/apache/openwhisk-runtime-nodejs/blob/master/.github/workflows/ci.yaml).

The actually failing testcase performs 128 (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L32) parallel action invokes/run requests into a single nodejs runtime action container. The scala test utilizes the AkkaContainerClient to open these 128 parallel connections (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L24). The nodejs test action invoked inside the action container (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L41) is coded to wait for 128 incoming invokes before it starts to complete all of them with a response (makes use of global variables). Means the first action run request to this action is held open and not answered before the 128th run reached the action code in this container. With this the test can be sure that this runtime can handle this number of concurrent action invokes being open at the same time. This test usually takes far less than 5 seconds in the github action while the timeout for this test is 30s.

Debugging it locally showed that with the current implementation in this PR the test seems not to reach the required 128 parallel connections anymore. It seems after a set of open connections is reached (far less than 128), no others are done anymore or maybe they arrive very, very slow. With this, the pending invokes in the action container do not return a response in time and the testcase fails after 30s. Reverting to the previous commit (0c27a650ab6073e131e5c74002465e93cf4d8621) resolves the issue and the concurrency test completes successful within the usual few seconds for nodejs:18 and nodejs:20.

Looking at the changes in this PR it does not look like a change is needed to adapt the nodejs runtime tests. Anyhow, with your broader akka background, do we need to modify something in the nodejs runtime tests to consume this latest apache/openwhisk core? Any hints are welcome :-).

style95 commented 1 year ago

@falkzoll Indeed.. Could you really resolve the issue by reverting the previous commit (https://github.com/apache/openwhisk/commit/0c27a650ab6073e131e5c74002465e93cf4d8621)? It's surprising because the userCPU config is disabled by default.

style95 commented 1 year ago

I think we should get into the CI environment and see what's going on. I think we can add an on-demand workflow along with Ngrok debugging. https://github.com/apache/openwhisk/blob/master/.github/workflows/0-on-demand.yaml#L23

falkzoll commented 1 year ago

@style95 thanks for your reply 👍 .... just to clarify... running the nodejs runtime tests locally with apache/openwhisk commit 6f11d48b216a01b7c6fa3342d2b8b972109106f7 (latest master right now) shows the error in the nodejs runtime concurrency tests.... going back one commit to commit 0c27a650ab6073e131e5c74002465e93cf4d8621 (which includes the CPU limits) works fine (no failures in these tests)... from that, it looks like the change was introduced with commit 6f11d48b216a01b7c6fa3342d2b8b972109106f7 and is not related to the changes for the CPU limits.

style95 commented 1 year ago

@falkzoll got it. I would take a look.

style95 commented 1 year ago

@falkzoll I created an issue: https://github.com/apache/openwhisk/issues/5452 It seems the underlying request queue is shutdown before sending all 128 concurrent requests.