Open tam512 opened 1 year ago
This issue is caused by spec.probes.liveness with default value, timeoutSeconds
httpGet:
path: /health/live
port: 9443
scheme: HTTPS
initialDelaySeconds: 60
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3
spec:
probes:
liveness:
timeoutSeconds: 5
kubectl -n restore-amd-ns get pods
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 8 (20m ago) 108m
microwebapp1-1 1/1 Running 0 107m
microwebapp1-2 1/1 Running 0 106m
limits.cpu: 2
with default liveness and still seeing pods restart
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 6 (56s ago) 13m
microwebapp1-1 1/1 Running 0 11m
microwebapp1-2 1/1 Running 0 11m
microwebapp1-0 0/1 CrashLoopBackOff 6 (1s ago) 14m
Test app image without instantOn checkpoints and deploy with the original settings in WebSphereLibertyApplication yaml (default probe liveness and limits.cpu: 1)
FROM stg.icr.io/cp/olc/open-liberty:beta-instanton-java17-openj9-ubi@sha256:1258654b82aadac51e8cf47247661fa1bfcb1b06d3a125be30caef21dd0da086
product = Open Liberty 23.0.0.4-beta (wlp-1.0.75.cl230320230319-1900)
java.version = 17.0.7
java.runtime = IBM Semeru Runtime Open Edition (17.0.7+5)
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 0 8m49s
microwebapp1-1 1/1 Running 4 (21s ago) 7m3s
microwebapp1-2 1/1 Running 0 6m48s
FROM icr.io/appcafe/open-liberty:beta
product = Open Liberty 23.0.0.3-beta (wlp-1.0.74.cl230220230222-1257)
java.version = 17.0.5
java.runtime = IBM Semeru Runtime Open Edition (17.0.5+8)
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 1 (14m ago) 19m
microwebapp1-1 1/1 Running 0 15m
microwebapp1-2 1/1 Running 0 15m
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 3 (6m12s ago) 54m
microwebapp1-1 1/1 Running 0 51m
microwebapp1-2 1/1 Running 0 50m
So this problem happens without InstantOn
It's interesting the Liveness
probe is constantly failing, since it should almost always return UP, if the server is started (assuming there are no user-defined Health Checks), the fact it is returning DOWN or worse case timing out without giving any status, leads me to think something might be going on with the environment, due to the heavy stress.
Did you try to test with a stand-alone WLO without any applications and no stress, and see if the Liveness probe still restarts the pod?
One other test you can try is to increase the initialDelaySeconds
to 120 (2 mins) or higher, so it first starts probing the Liveness endpoint after 2 mins, i think there might be a race condition, where it starts probing the Liveness endpoint before the server has fully started.
If the above two tests don't give us any useful information, I think it's best to enable the MP Health tracing and see what's going on. Can you please try enabling the following trace specification:
HEALTH=all
I did some more testing to debug this issue and here is what found:
I tried to reduce the load and monitor CPU on the pods. I started with 1 thread and double it (1, 2, 4, 8, 16). I saw that I could run with 16 threads ok and do not see pods restart. I also saw that the CPU usage on the pods didn't do over the CPU limits of 1024m (Notes that I have spec.resources.limits.cpu: 1024m, spec.resources.limits.memory: 512Mi, and spec.autoscaling.maxReplicas: 3)
% watch kubectl top pod -n restore-amd-ns
NAME CPU(cores) MEMORY(bytes) microwebapp1-0 818m 162Mi microwebapp1-1 947m 162Mi microwebapp1-2 807m 161Mi
- pods are up and running ok with 16 threads
% kubectl -n restore-amd-ns get pods -w
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 0 18h
microwebapp1-1 1/1 Running 0 18h
microwebapp1-2 1/1 Running 0 18h
NAME CPU(cores) MEMORY(bytes)
microwebapp1-0 1024m 137Mi
microwebapp1-1 803m 140Mi
microwebapp1-2 833m 144Mi
% kubectl -n restore-amd-ns get pods -w
NAME READY STATUS RESTARTS AGE
microwebapp1-0 1/1 Running 0 18h
microwebapp1-1 1/1 Running 0 77s
microwebapp1-2 1/1 Running 0 75s
microwebapp1-0 1/1 Running 1 (1s ago) 18h
So with liveness default values, it checks on the container liveness every 10 seconds, and the timeout is 2 seconds, when liveness probe fails 3 times, then the pod is restarted.
Test InstantOn with Open Liberty 23.0.0.4-beta and Java IBM Semeru Runtime Open Edition (17.0.7+5) Deploy an application checkpoint image on Amazon EKS, and run stress test and see that the application pods keep restarting
CWWKE0085I: The server defaultServer is stopping because the JVM is exiting.
. Below is the full log