OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.14k stars 587 forks source link

Restore failing for 23.0.0.10 OL with Java 11 on EKS cluster #26523

Open mtamboli opened 11 months ago

mtamboli commented 11 months ago

I am testing the latest 23.0.0.10 OL images which are based on UBI 9 and Java 11. Checkpoint was was taken on AMD Ubuntu machine and tryin to restore as part of deploying application to EKS cluster. Please let me know if you need any further information.

FYI @tjwatson @tam512 @rumanaHaque


kubectl config set-context --current --namespace=ebuy-amdub22-11olf
Context "arn:aws:eks:us-east-2:439159788015:cluster/mstEKScluster" modified.

[root@04643dbe8e4f ~]# kubectl get pods
NAME                   READY   STATUS             RESTARTS        AGE
ebuy-amdub22-11olf-0   0/1     CrashLoopBackOff   156 (51s ago)   13h

[root@04643dbe8e4f ~]#kubectl get olapp
NAME                 IMAGE                                                                                                                                                                     EXPOSED   RECONCILED   RESOURCESREADY   READY   AGE
ebuy-amdub22-11olf   docker-na-public.artifactory.swg-devops.com/hyc-wassvt-team-image-registry-docker-loc

[root@04643dbe8e4f ~]#kubectl logs ebuy-amdub22-11olf-0
Found mounted TLS certificates, generating keystore
Found mounted TLS CA certificate, adding to truststore

CWWKE0964E: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. The server did not launch because checkpoint restore recovery is disabled.
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Error (criu/sockets.c:210): sockets: Diag module missing (-2)
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
  1030: Warn  (criu/sockets.c:544): sockets:    socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
Error (criu/cr-restore.c:1514): 1030 killed by signal 11: Segmentation fault
Error (criu/cr-restore.c:2547): Restoring FAILED.

[root@04643dbe8e4f ~]#kubectl -n ebuy-amdub22-11olf exec ebuy-amdub22-11olf-0 -- cat /logs/checkpoint/restore.log
error: unable to upgrade connection: container not found ("app")
mtamboli commented 11 months ago

Problem is also happening on images based on Java 17. Here is more detailed log: ebuy-amdrh90-17ol.log

mtamboli commented 11 months ago

Verified this problem does not happen for OL images based on OL images based on ubi 8

leochr commented 11 months ago

Is there an equivalent issue with the Semeru/Java team for this? We just want to track progress to ensure it's resolved prior to the DCUT for Liberty 24.0.0.1 release

malincoln commented 11 months ago

yes, it's https://github.com/ibmruntimes/Semeru-Runtimes/issues/62

malincoln commented 7 months ago

Java issue is now closed.