Closed joni-jones closed 1 year ago
I approved running the CI tests. Changes looked plausible to me, but I'm not an Akka expert.
I'm looking into failed tests due to NullPointerException
on sinkCompletion.onComplete
step.
It looks like the NullPointerException
appears due to CouchDbRestClient
redefining the execution context.
I changed the code to be able to propagate the right context.
Merging #5442 (d084a4f) into master (20f7d98) will decrease coverage by
25.91%
. Report is 1 commits behind head on master. The diff coverage is75.00%
.:exclamation: Current head d084a4f differs from pull request most recent head 3fb2fc5. Consider uploading reports for the commit 3fb2fc5 to get more accurate results
@@ Coverage Diff @@
## master #5442 +/- ##
===========================================
- Coverage 76.82% 50.91% -25.91%
===========================================
Files 241 241
Lines 14632 14630 -2
Branches 616 606 -10
===========================================
- Hits 11241 7449 -3792
- Misses 3391 7181 +3790
Files Changed | Coverage Δ | |
---|---|---|
...whisk/core/containerpool/AkkaContainerClient.scala | 44.23% <ø> (-26.93%) |
:arrow_down: |
.../containerpool/logging/ElasticSearchLogStore.scala | 0.00% <0.00%> (-92.86%) |
:arrow_down: |
...ontainerpool/logging/ElasticSearchRestClient.scala | 8.88% <ø> (-68.89%) |
:arrow_down: |
.../org/apache/openwhisk/http/PoolingRestClient.scala | 87.09% <80.00%> (-3.82%) |
:arrow_down: |
...he/openwhisk/core/database/CouchDbRestClient.scala | 70.83% <100.00%> (-16.67%) |
:arrow_down: |
... and 149 files with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
LGTM as well
Hi @joni-jones , maybe you or someone else here in this PR can help us to understand an issue in the https://github.com/apache/openwhisk-runtime-nodejs runtime... actually the scheduled openwhisk nodejs runtime builds fail (https://github.com/apache/openwhisk-runtime-nodejs/actions). They fail with a timeout (after 30s) in a certain testcase that tests the concurrent invocation capability (number of concurrent invokes/runs for a single action container) of this runtime (for nodejs:18 and nodejs:20).
runtime.actionContainers.NodeJs18ConcurrentTests > action-nodejs-v18 should allow running activations concurrently FAILED
java.util.concurrent.TimeoutException
at org.apache.openwhisk.core.containerpool.AkkaContainerClient$.$anonfun$executeRequest$1(AkkaContainerClient.scala:252)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Same for nodejs:20. This kind of concurrency is actually only supported in the nodejs runtimes. Other runtimes just support '__OW_ALLOW_CONCURRENT=false' and can only handle one invoke/run per action container at a time.
To run the tests, the github action of the nodejs runtime clones the latest available apache/openwhisk repository and uses its master as the base to run its tests (https://github.com/apache/openwhisk-runtime-nodejs/blob/master/.github/workflows/ci.yaml).
The actually failing testcase performs 128 (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L32) parallel action invokes/run requests into a single nodejs runtime action container. The scala test utilizes the AkkaContainerClient to open these 128 parallel connections (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L24). The nodejs test action invoked inside the action container (https://github.com/apache/openwhisk-runtime-nodejs/blob/c60a6676375d85878c658412162004848c19f965/tests/src/test/scala/runtime/actionContainers/NodeJsConcurrentTests.scala#L41) is coded to wait for 128 incoming invokes before it starts to complete all of them with a response (makes use of global variables). Means the first action run request to this action is held open and not answered before the 128th run reached the action code in this container. With this the test can be sure that this runtime can handle this number of concurrent action invokes being open at the same time. This test usually takes far less than 5 seconds in the github action while the timeout for this test is 30s.
Debugging it locally showed that with the current implementation in this PR the test seems not to reach the required 128 parallel connections anymore. It seems after a set of open connections is reached (far less than 128), no others are done anymore or maybe they arrive very, very slow. With this, the pending invokes in the action container do not return a response in time and the testcase fails after 30s. Reverting to the previous commit (0c27a650ab6073e131e5c74002465e93cf4d8621) resolves the issue and the concurrency test completes successful within the usual few seconds for nodejs:18 and nodejs:20.
Looking at the changes in this PR it does not look like a change is needed to adapt the nodejs runtime tests. Anyhow, with your broader akka background, do we need to modify something in the nodejs runtime tests to consume this latest apache/openwhisk core? Any hints are welcome :-).
@falkzoll
Indeed..
Could you really resolve the issue by reverting the previous commit (https://github.com/apache/openwhisk/commit/0c27a650ab6073e131e5c74002465e93cf4d8621)?
It's surprising because the userCPU
config is disabled by default.
I think we should get into the CI environment and see what's going on. I think we can add an on-demand workflow along with Ngrok debugging. https://github.com/apache/openwhisk/blob/master/.github/workflows/0-on-demand.yaml#L23
@style95 thanks for your reply 👍 .... just to clarify... running the nodejs runtime tests locally with apache/openwhisk commit 6f11d48b216a01b7c6fa3342d2b8b972109106f7 (latest master right now) shows the error in the nodejs runtime concurrency tests.... going back one commit to commit 0c27a650ab6073e131e5c74002465e93cf4d8621 (which includes the CPU limits) works fine (no failures in these tests)... from that, it looks like the change was introduced with commit 6f11d48b216a01b7c6fa3342d2b8b972109106f7 and is not related to the changes for the CPU limits.
@falkzoll got it. I would take a look.
@falkzoll I created an issue: https://github.com/apache/openwhisk/issues/5452 It seems the underlying request queue is shutdown before sending all 128 concurrent requests.
Description
Related issue and scope
My changes affect the following components
Types of changes
Checklist: