OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 592 forks source link

OL23.0.0.4-beta: application pods keep restarting during stress test #24906

Open tam512 opened 1 year ago

tam512 commented 1 year ago

Test InstantOn with Open Liberty 23.0.0.4-beta and Java IBM Semeru Runtime Open Edition (17.0.7+5) Deploy an application checkpoint image on Amazon EKS, and run stress test and see that the application pods keep restarting

% kubectl -n restore-amd-ns get pods -w
NAME       READY  STATUS       RESTARTS   AGE
microwebapp1-0  1/1   Running      2 (24s ago)  18m
microwebapp1-1  1/1   Running      6 (33s ago)  14m
microwebapp1-2  0/1   CrashLoopBackOff  6 (80s ago)  13m
tam512 commented 1 year ago

This issue is caused by spec.probes.liveness with default value, timeoutSeconds

tam512 commented 1 year ago

Test app image without instantOn checkpoints and deploy with the original settings in WebSphereLibertyApplication yaml (default probe liveness and limits.cpu: 1)

  1. Build app without checkpoint using FROM stg.icr.io/cp/olc/open-liberty:beta-instanton-java17-openj9-ubi@sha256:1258654b82aadac51e8cf47247661fa1bfcb1b06d3a125be30caef21dd0da086
    • versions show
      product = Open Liberty 23.0.0.4-beta (wlp-1.0.75.cl230320230319-1900)
      java.version = 17.0.7
      java.runtime = IBM Semeru Runtime Open Edition (17.0.7+5)
    • Still see pods restart
      NAME             READY   STATUS    RESTARTS      AGE
      microwebapp1-0   1/1     Running   0             8m49s
      microwebapp1-1   1/1     Running   4 (21s ago)   7m3s
      microwebapp1-2   1/1     Running   0             6m48s
  2. Rebuild app image without instantOn checkpoints using FROM icr.io/appcafe/open-liberty:beta
    • product version shows
      product = Open Liberty 23.0.0.3-beta (wlp-1.0.74.cl230220230222-1257)
      java.version = 17.0.5
      java.runtime = IBM Semeru Runtime Open Edition (17.0.5+8)
    • still see pods restart, but it seems like it takes longer time for pods to restart
    • after 20 minutes
      NAME             READY   STATUS    RESTARTS      AGE
      microwebapp1-0   1/1     Running   1 (14m ago)   19m
      microwebapp1-1   1/1     Running   0             15m
      microwebapp1-2   1/1     Running   0             15m
    • after 54 minutes
      NAME             READY   STATUS    RESTARTS        AGE
      microwebapp1-0   1/1     Running   3 (6m12s ago)   54m
      microwebapp1-1   1/1     Running   0               51m
      microwebapp1-2   1/1     Running   0               50m

      So this problem happens without InstantOn

pgunapal commented 1 year ago

It's interesting the Liveness probe is constantly failing, since it should almost always return UP, if the server is started (assuming there are no user-defined Health Checks), the fact it is returning DOWN or worse case timing out without giving any status, leads me to think something might be going on with the environment, due to the heavy stress.

tam512 commented 1 year ago

I did some more testing to debug this issue and here is what found:

  1. This problem only happens during stress/load test so point 1 above is not applicable
  2. I tested this with instantOn checkpoint image so the server starts very quick within range 0.2 to 0.5 second (less than 1 second) so I think the default of initialDelaySeconds of 60 seconds is ok in this case.
  3. I tried to reduce the load and monitor CPU on the pods. I started with 1 thread and double it (1, 2, 4, 8, 16). I saw that I could run with 16 threads ok and do not see pods restart. I also saw that the CPU usage on the pods didn't do over the CPU limits of 1024m (Notes that I have spec.resources.limits.cpu: 1024m, spec.resources.limits.memory: 512Mi, and spec.autoscaling.maxReplicas: 3)

    • snap shot of CPU usage on pods when running with 16 threads
      
      % watch kubectl top pod -n restore-amd-ns

    NAME CPU(cores) MEMORY(bytes) microwebapp1-0 818m 162Mi microwebapp1-1 947m 162Mi microwebapp1-2 807m 161Mi

    - pods are up and running ok with 16 threads

    % kubectl -n restore-amd-ns get pods -w
    NAME READY STATUS RESTARTS AGE microwebapp1-0 1/1 Running 0 18h microwebapp1-1 1/1 Running 0 18h microwebapp1-2 1/1 Running 0 18h

  4. When I started running with 20 threads, I saw that when the CPU usage on a pod reached 1024m, the pod was restarted
    • snap shot cpu reaching the limit when running with 20 threads
      NAME             CPU(cores)   MEMORY(bytes) 
      microwebapp1-0   1024m        137Mi
      microwebapp1-1   803m         140Mi
      microwebapp1-2   833m         144Mi
    • pod restart
      % kubectl -n restore-amd-ns get pods -w
      NAME             READY   STATUS    RESTARTS   AGE
      microwebapp1-0   1/1     Running   0          18h
      microwebapp1-1   1/1     Running   0          77s
      microwebapp1-2   1/1     Running   0          75s
      microwebapp1-0   1/1     Running   1 (1s ago)   18h

So with liveness default values, it checks on the container liveness every 10 seconds, and the timeout is 2 seconds, when liveness probe fails 3 times, then the pod is restarted.