OL23.0.0.4-beta: application pods keep restarting during stress test

tam512 commented 1 year ago

Test InstantOn with Open Liberty 23.0.0.4-beta and Java IBM Semeru Runtime Open Edition (17.0.7+5) Deploy an application checkpoint image on Amazon EKS, and run stress test and see that the application pods keep restarting

% kubectl -n restore-amd-ns get pods -w
NAME       READY  STATUS       RESTARTS   AGE
microwebapp1-0  1/1   Running      2 (24s ago)  18m
microwebapp1-1  1/1   Running      6 (33s ago)  14m
microwebapp1-2  0/1   CrashLoopBackOff  6 (80s ago)  13m

View the previous log of the pod and see message CWWKE0085I: The server defaultServer is stopping because the JVM is exiting.. Below is the full log

% kubectl -n restore-amd-ns logs microwebapp1-0 --previous
Found mounted TLS certificates, generating keystore
Found mounted TLS CA certificate, adding to truststore
[3/30/23, 21:06:29:447 UTC] 0000002a com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application microwebapp started in 0.095 seconds.
[3/30/23, 21:06:29:457 UTC] 0000002a com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0016I: Starting server configuration update.
[3/30/23, 21:06:29:458 UTC] 0000002a com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/checkpoint.xml
[3/30/23, 21:06:29:459 UTC] 0000002a com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[3/30/23, 21:06:29:459 UTC] 0000002a com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[3/30/23, 21:06:29:461 UTC] 0000002a com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/tls.xml
[3/30/23, 21:06:29:462 UTC] 0000002a com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[3/30/23, 21:06:29:512 UTC] 00000036 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0017I: The server configuration was successfully updated in 0.056 seconds.
[3/30/23, 21:06:29:589 UTC] 0000002a io.openliberty.checkpoint.internal.CheckpointImpl            A CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 0.238 seconds.
[3/30/23, 21:06:29:593 UTC] 0000002a com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint has been started and is now listening for requests on host *  (IPv4) port 9080.
[3/30/23, 21:06:29:598 UTC] 00000037 com.ibm.ws.security.token.ltpa.internal.LTPAKeyCreateTask    I CWWKS4105I: LTPA configuration is ready after 0.092 seconds.
[3/30/23, 21:06:29:603 UTC] 0000002a com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0012I: The server installed the following features: [appSecurity-4.0, cdi-3.0, checkpoint-1.0, concurrent-2.0, distributedMap-1.0, expressionLanguage-4.0, jndi-1.0, json-1.0, jsonb-2.0, jsonp-2.0, jwt-1.0, microProfile-5.0, monitor-1.0, mpConfig-3.0, mpFaultTolerance-4.0, mpHealth-4.0, mpJwt-2.0, mpMetrics-4.0, mpOpenAPI-3.0, mpOpenTracing-3.0, mpRestClient-3.0, pages-3.0, restfulWS-3.0, restfulWSClient-3.0, servlet-5.0, ssl-1.0, transportSecurity-1.0].
[3/30/23, 21:06:29:604 UTC] 0000002a com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 0.253 seconds.
[3/30/23, 21:06:29:604 UTC] 0000002a com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 0.253 seconds.
[3/30/23, 21:06:29:736 UTC] 00000026 com.ibm.ws.ssl.config.WSKeyStore                             I Successfully loaded default keystore: /opt/ol/wlp/output/defaultServer/resources/security/key.p12 of type: PKCS12
[3/30/23, 21:06:29:744 UTC] 00000026 com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtServiceImpl   I CWWKS5500I: The MicroProfile JWT configuration [MicroProfileJwtService] was successfully processed.
[3/30/23, 21:06:29:745 UTC] 00000026 com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtConfigImpl    I CWWKS5500I: The MicroProfile JWT configuration [defaultMpJwt] was successfully processed.
[3/30/23, 21:06:29:755 UTC] 0000003c com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint-ssl has been started and is now listening for requests on host *  (IPv4) port 9443.
[3/30/23, 21:06:43:952 UTC] 0000003c SystemOut                                                    O Initializing for java.lang.Object@7b4cf335
[3/30/23, 21:06:43:953 UTC] 0000003c com.ibm.ws.webcontainer.servlet                              I SRVE0242I: [microwebapp] [/microwebapp] [CpuAndSleepBound]: Initialization successful.
[3/30/23, 21:06:43:959 UTC] 0000003c SystemOut                                                    O begun = 1, done = 0, conc = 1 range = [1, 0.0, 1] at 2023-03-30 21:06:43.959 for java.lang.Object@7b4cf335
[3/30/23, 21:07:13:990 UTC] 0000003f SystemOut                                                    O begun = 74, done = 70, conc = 4 range = [1, 4.020745920745921, 5] at 2023-03-30 21:07:13.989 for java.lang.Object@7b4cf335
[3/30/23, 21:07:44:318 UTC] 00000046 SystemOut                                                    O begun = 151, done = 147, conc = 4 range = [2, 4.230472485080286, 6] at 2023-03-30 21:07:44.318 for java.lang.Object@7b4cf335
[3/30/23, 21:08:14:408 UTC] 0000004a SystemOut                                                    O begun = 230, done = 225, conc = 5 range = [2, 4.750481887670323, 6] at 2023-03-30 21:08:14.408 for java.lang.Object@7b4cf335
[3/30/23, 21:08:31:408 UTC] 0000001b com.ibm.ws.kernel.launch.internal.FrameworkManager           A CWWKE0085I: The server defaultServer is stopping because the JVM is exiting.
[3/30/23, 21:08:31:413 UTC] 00000050 com.ibm.ws.runtime.update.internal.RuntimeUpdateManagerImpl  A CWWKE1100I: Waiting for up to 30 seconds for the server to quiesce.
[3/30/23, 21:08:37:586 UTC] 00000051 com.ibm.ws.tcpchannel.internal.TCPChannel                    I CWWKO0220I: TCP Channel defaultHttpEndpoint has stopped listening for requests on host *  (IPv4) port 9080.
[3/30/23, 21:08:37:884 UTC] 0000004c com.ibm.ws.tcpchannel.internal.TCPChannel                    I CWWKO0220I: TCP Channel defaultHttpEndpoint-ssl has stopped listening for requests on host *  (IPv4) port 9443.
[3/30/23, 21:08:37:885 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/jwt/
[3/30/23, 21:08:37:885 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/openapi/ui/
[3/30/23, 21:08:37:885 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/health/
[3/30/23, 21:08:37:886 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/metrics/
[3/30/23, 21:08:37:886 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/microwebapp/
[3/30/23, 21:08:37:886 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/openapi/
[3/30/23, 21:08:37:886 UTC] 0000004c com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0017I: Web application removed (default_host): https://10.88.0.1:9443/ibm/api/
[3/30/23, 21:08:42:097 UTC] 00000050 com.ibm.ws.runtime.update.internal.RuntimeUpdateManagerImpl  I CWWKE1101I: Server quiesce complete.
[3/30/23, 21:08:44:624 UTC] 0000004b SystemOut                                                    O begun = 306, done = 303, conc = 3 range = [3, 5.091772570823405, 7] at 2023-03-30 21:08:44.624 for java.lang.Object@7b4cf335
[3/30/23, 21:08:45:816 UTC] 0000004c com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0009I: The application microwebapp has stopped successfully.
[3/30/23, 21:08:46:998 UTC] 00000050 rty.security.mp.jwt.v12.config.impl.MpConfigProxyServiceImpl I CWWKS5782I: The MicroProfile JWT version 1.2 mpConfigProxy deactivated successfully.
[3/30/23, 21:08:47:004 UTC] 00000050 com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtConfigImpl    I CWWKS5502I: The MicroProfile JWT configuration [defaultMpJwt] was successfully deactivated.
[3/30/23, 21:08:47:007 UTC] 00000050 com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtServiceImpl   I CWWKS5502I: The MicroProfile JWT configuration [MicroProfileJwtService] was successfully deactivated.
[3/30/23, 21:08:47:189 UTC] 00000050 .microprofile.metrics.internal.monitor.MonitorMetricsHandler I CWPMI2004I: All monitoring metrics were unregistered from mpMetrics.
[3/30/23, 21:08:47:391 UTC] 00000050 ibm.ws.security.authentication.internal.jaas.JAASServiceImpl I CWWKS1124I: The collective authentication plugin with class name NullCollectiveAuthenticationPlugin has been deactivated. 
[3/30/23, 21:08:47:485 UTC] 00000050 com.ibm.ws.security.ready.internal.SecurityReadyServiceImpl  I CWWKS0009I: The security service has stopped.

This problem can be recreated when deploying the WebSphereLiberyApplication using WebSphere Liberty Operator with probes

apiVersion: liberty.websphere.ibm.com/v1
kind: WebSphereLibertyApplication
.........
spec:
autoscaling:
minReplicas: 1
maxReplicas: 3
targetCPUUtilizationPercentage: 80
probes:
readiness: {}
liveness: {}
startup: {}
.......

tam512 commented 1 year ago

This issue is caused by spec.probes.liveness with default value, timeoutSeconds

Default liveness values are

httpGet:
path: /health/live
port: 9443
scheme: HTTPS
initialDelaySeconds: 60
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3

I increased liveness timeoutSeconds to 5 and I see that it takes longer time for the pods to restart
```
spec:
probes:
liveness:
  timeoutSeconds: 5
```

pod-0 still restarting

kubectl -n restore-amd-ns get pods   
NAME             READY   STATUS    RESTARTS      AGE
microwebapp1-0   1/1     Running   8 (20m ago)   108m
microwebapp1-1   1/1     Running   0             107m
microwebapp1-2   1/1     Running   0             106m

I also tried to increase limits.cpu: 2 with default liveness and still seeing pods restart

NAME             READY   STATUS    RESTARTS      AGE
microwebapp1-0   1/1     Running   6 (56s ago)   13m
microwebapp1-1   1/1     Running   0             11m
microwebapp1-2   1/1     Running   0             11m
microwebapp1-0   0/1     CrashLoopBackOff   6 (1s ago)    14m

tam512 commented 1 year ago

Test app image without instantOn checkpoints and deploy with the original settings in WebSphereLibertyApplication yaml (default probe liveness and limits.cpu: 1)

Build app without checkpoint using

FROM stg.icr.io/cp/olc/open-liberty:beta-instanton-java17-openj9-ubi@sha256:1258654b82aadac51e8cf47247661fa1bfcb1b06d3a125be30caef21dd0da086

versions show

product = Open Liberty 23.0.0.4-beta (wlp-1.0.75.cl230320230319-1900)
java.version = 17.0.7
java.runtime = IBM Semeru Runtime Open Edition (17.0.7+5)

Still see pods restart

NAME             READY   STATUS    RESTARTS      AGE
microwebapp1-0   1/1     Running   0             8m49s
microwebapp1-1   1/1     Running   4 (21s ago)   7m3s
microwebapp1-2   1/1     Running   0             6m48s

Rebuild app image without instantOn checkpoints using FROM icr.io/appcafe/open-liberty:beta

product version shows

product = Open Liberty 23.0.0.3-beta (wlp-1.0.74.cl230220230222-1257)
java.version = 17.0.5
java.runtime = IBM Semeru Runtime Open Edition (17.0.5+8)

still see pods restart, but it seems like it takes longer time for pods to restart

after 20 minutes

NAME             READY   STATUS    RESTARTS      AGE
microwebapp1-0   1/1     Running   1 (14m ago)   19m
microwebapp1-1   1/1     Running   0             15m
microwebapp1-2   1/1     Running   0             15m

after 54 minutes

NAME             READY   STATUS    RESTARTS        AGE
microwebapp1-0   1/1     Running   3 (6m12s ago)   54m
microwebapp1-1   1/1     Running   0               51m
microwebapp1-2   1/1     Running   0               50m

So this problem happens without InstantOn

pgunapal commented 1 year ago

It's interesting the Liveness probe is constantly failing, since it should almost always return UP, if the server is started (assuming there are no user-defined Health Checks), the fact it is returning DOWN or worse case timing out without giving any status, leads me to think something might be going on with the environment, due to the heavy stress.

Did you try to test with a stand-alone WLO without any applications and no stress, and see if the Liveness probe still restarts the pod?
One other test you can try is to increase the initialDelaySeconds to 120 (2 mins) or higher, so it first starts probing the Liveness endpoint after 2 mins, i think there might be a race condition, where it starts probing the Liveness endpoint before the server has fully started.
If the above two tests don't give us any useful information, I think it's best to enable the MP Health tracing and see what's going on. Can you please try enabling the following trace specification: HEALTH=all

tam512 commented 1 year ago

I did some more testing to debug this issue and here is what found:

This problem only happens during stress/load test so point 1 above is not applicable
I tested this with instantOn checkpoint image so the server starts very quick within range 0.2 to 0.5 second (less than 1 second) so I think the default of initialDelaySeconds of 60 seconds is ok in this case.
I tried to reduce the load and monitor CPU on the pods. I started with 1 thread and double it (1, 2, 4, 8, 16). I saw that I could run with 16 threads ok and do not see pods restart. I also saw that the CPU usage on the pods didn't do over the CPU limits of 1024m (Notes that I have spec.resources.limits.cpu: 1024m, spec.resources.limits.memory: 512Mi, and spec.autoscaling.maxReplicas: 3)
- snap shot of CPU usage on pods when running with 16 threads
```
% watch kubectl top pod -n restore-amd-ns
```
NAME CPU(cores) MEMORY(bytes) microwebapp1-0 818m 162Mi microwebapp1-1 947m 162Mi microwebapp1-2 807m 161Mi
```
- pods are up and running ok with 16 threads
```
% kubectl -n restore-amd-ns get pods -w
NAME READY STATUS RESTARTS AGE microwebapp1-0 1/1 Running 0 18h microwebapp1-1 1/1 Running 0 18h microwebapp1-2 1/1 Running 0 18h

When I started running with 20 threads, I saw that when the CPU usage on a pod reached 1024m, the pod was restarted

snap shot cpu reaching the limit when running with 20 threads

NAME             CPU(cores)   MEMORY(bytes) 
microwebapp1-0   1024m        137Mi
microwebapp1-1   803m         140Mi
microwebapp1-2   833m         144Mi

pod restart

% kubectl -n restore-amd-ns get pods -w
NAME             READY   STATUS    RESTARTS   AGE
microwebapp1-0   1/1     Running   0          18h
microwebapp1-1   1/1     Running   0          77s
microwebapp1-2   1/1     Running   0          75s
microwebapp1-0   1/1     Running   1 (1s ago)   18h

So with liveness default values, it checks on the container liveness every 10 seconds, and the timeout is 2 seconds, when liveness probe fails 3 times, then the pod is restarted.

OpenLiberty / open-liberty

OL23.0.0.4-beta: application pods keep restarting during stress test #24906