OpenLiberty / ci.docker

Eclipse Public License 1.0
42 stars 59 forks source link

Restore failed on OCP Power9 for Java 21 #540

Open abdulmateen-1 opened 1 month ago

abdulmateen-1 commented 1 month ago

Java Defect: https://github.com/eclipse-openj9/openj9/issues/19511

Hello team, we're on Liberty 24.0.0.5 and Java 21. We did a checkpoint on an EBC P9 VM and did a restore on an OCP P9 VM which failed failed with the error below.

Found mounted TLS certificates, generating keystore
CWWKE0964E: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. The server did not launch because checkpoint restore recovery is disabled.
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
1023: Warn (criu/sockets.c:544): sockets: socket has dumped SO_BUF_LOCK state but kernel doesn't support SO_BUF_LOCK
pie: 1023: Error (criu/pie/restorer.c:1676): Can't restore 0x78f367cf0000 mapping with 0xfffffffffffffff3
pie: 1023: Error (criu/pie/restorer.c:2102): Restorer fail 1023
Error (criu/cr-restore.c:2547): Restoring FAILED.

The lscpu from OCP Power9

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7

The result of uname -r on both machines

OCP
sh-4.4# uname -r
5.14.0-284.59.1.el9_2.ppc64le

P9VM
5.15.0-97-generic

To get more trace we set the CRIU logging level to 4 and I included the output of the log file. ebuy-instanton-1-597864cdcd-fk8lz-app.log

leochr commented 1 month ago

FYI, this is being discussed in the private instanton Slack channel with Java/Semeru team. Depending on the discussion, this might get moved elsewhere: https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1715103170419049

abdulmateen-1 commented 1 month ago

We tested different Test case scenarios to properly determine the problem is from

1. Take the checkpoint and restore on a P9 image if successful, perform a restore on an OCP-P9 cluster

The restore was successful on the P9 VM machine but failed on the OCP-P9 cluster.

(00.059541) pie: 1025:  mmap(0x7a1183240000 -> 0x7a1183250000, 0x3 0x12 398)
(00.059548) pie: 1025:  mmap(0x7a1183250000 -> 0x7a11832b0000, 0x7 0x12 399)
(00.059554) pie: 1025:  mmap(0x7a11832b0000 -> 0x7a11832c0000, 0x3 0x12 399)
(00.059560) pie: 1025:  mmap(0x7a11832c0000 -> 0x7a11832d0000, 0x3 0x12 399)
(00.059567) pie: 1025:  mmap(0x7a11832d0000 -> 0x7a11833a0000, 0x3 0x32 -1)
(00.059573) pie: 1025:  mmap(0x7a11833a0000 -> 0x7a11833b0000, 0x0 0x32 -1)
(00.059578) pie: 1025:  mmap(0x7a11833b0000 -> 0x7a11833f0000, 0x3 0x32 -1)
(00.059583) pie: 1025:  mmap(0x7a11833f0000 -> 0x7a1183400000, 0x1 0x11 400)
(00.059717) pie: 1025: Error (criu/pie/restorer.c:1676): Can't restore 0x7a11833f0000 mapping with 0xfffffffffffffff3
(00.059728) pie: 1025: Error (criu/pie/restorer.c:2102): Restorer fail 1025
(00.059965) Error (criu/cr-restore.c:2547): Restoring FAILED.

restore.log ebuy-instanton-1-76df794d95-bxf4l-app.log

2. Take the checkpoint and restore on a P10 image if sucessfuly, perform a restore on an OCP-P9 cluster

The checkpoint on the P10 machine initially failed with this error.

[ERROR   ] CWWKE0701E: bundle com.ibm.ws.tx.embeddable.jakarta:1.0.89.cl240620240507-2000 (455)[com.ibm.tx.jta.embeddable.impl.EmbeddableTMHelper(314)] : The shutdown method has thrown an exception java.lang.NullPointerException: Cannot invoke "com.ibm.tx.jta.impl.TxRecoveryAgentImpl.stop(boolean)" because "com.ibm.tx.jta.util.TxTMHelper._recoveryAgent" is null
    at com.ibm.tx.jta.util.TxTMHelper.shutdown(TxTMHelper.java:545)
    at [internal classes]

[AUDIT   ] CWWKE0036I: The server defaultServer stopped after 8.602 seconds.
CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (compel/src/lib/infect.c:133): Unable to interrupt task: 1117 (Operation not permitted)
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Warn  (criu/libnetlink.c:84): Can't send request message
Error (criu/sockets.c:436): sockets: Sockets (family 1 proto 0) are not collected
Error (criu/sk-unix.c:351): unix: Unix socket 1351736 not found
Error (criu/cr-dump.c:1669): Dump files (pid: 1023) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.
CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command.

Following the request by InstantOn team, On a regression scenario to taking checkpoint on a P10 image and restore: Try the getsebool virt_sandbox_use_netlink command, if that returns 0 do setsebool, virt_sandbox_use_netlink 1 and try the checkpoint again.

We ran the below command to get it working on a P10 machine.

getsebool virt_sandbox_use_netlink command, if that returns 0 
do setsebool virt_sandbox_use_netlink 1 and try the checkpoint again

this time the restore was successful on the P10 image, but the restore on OCP-P9 image failed but with a different error.

(00.143171) 1023 (native) is going to execute the syscall 91, required is 172
(00.143190) 1023 was trapped
(00.143200) 1023 (native) is going to execute the syscall 172, required is 172
(00.143225) 1023 was stopped
(00.143262) 1023 was trapped
(00.143274) 1023 (native) is going to execute the syscall 91, required is 91
(00.143335) 1023 was stopped
(00.143352) Run late stage hook from criu master for external devices
(00.143359) restore late stage hook for external plugin failed
(00.143367) Run late stage hook from criu master for external devices
(00.143374) restore late stage hook for external plugin failed
(00.143381) Running pre-resume scripts
(00.146642) Error (criu/cr-restore.c:2146): Unable to detach 1077: No such process
(00.146674) Error (criu/cr-restore.c:2520): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.147040) Error (criu/cr-restore.c:2547): Restoring FAILED.

ebuy-instanton-1-76df794d95-zfl99-app.log

3. Change the application the checkpoint is performed on and restore on an OCP-P9 cluster.

This time we tried to deploy a different application, to see if the error had to do with the app. The restore was successful on the OCP-P9 cluster this time around. ebuy-instanton-1-76df794d95-bxf4l-app.log

mtamboli commented 1 month ago

Old checkpoint images which were taken using WLO 24.0.0.3 are getting restored fine on OCP -P 4.15 cluster for eBuy.

abdulmateen-1 commented 1 month ago

A new image was built using 24.0.0.3 driver. The restore was successful on the on-prem VM but failed on the OCP instance. ebuy-instanton-1-7b64849466-nqnn2-app.log

ymanton commented 1 month ago

On Power Linux platforms OpenJ9 loads a special file called /proc/powerpc/systemcfg. This is a virtual file provided by the kernel and it contains detailed info about the system. We use some of the info to provide faster time keeping functions. Loading of this file is optional; if we fail to load it we gracefully fall back.

On systems that have SELinux set to Enforcing mode access to this file is blocked. This can be seen in the audit log /var/log/audit/audit.log:

type=AVC msg=audit(1715833184.627:1597): avc:  denied  { map } for  pid=1316985 comm="java" path="/proc/powerpc/systemcfg" dev="proc" ino=4026531945 scontext=system_u:system_r:container_t:s0:c30,c433 tcontext=system_u:object_r:proc_t:s0 tclass=file permissive=0

This problem is happening because the checkpoint was taken on a system where access to this file was not blocked (Ubuntu 22.04, SELinux either not installed, disabled, or set to Permissive mode) and the restore was attempted on a system where access is blocked (OCP 4.15, SELinux set to Enforcing mode).

The older checkpoint image taken with the old 24.0.0.3 driver that can still be restored on OCP works because the JVM did not load the systemcfg file. That checkpoint was taken on a RHEL 9 system, not an Ubuntu system, and that system likely had SELinux set to Enforcing mode, which prevented us from loading systemcfg in the first place. That checkpoint can be restored on OCP without any problem because we don't try to reload systemcfg.

This is also why taking a new checkpoint on Ubuntu with the old 24.0.0.3 driver still fails. If we set SELinux to Enforcing mode on Ubuntu, or take the checkpoint on a RHEL 9 machine, it will likely restore fine on OCP. Alternatively, disabling SELinux or setting it to Permissive mode on OCP allows the Ubuntu checkpoint to restore (I tested this scenario).

My conclusion is that this is not a regression, just a combination of checkpoint host and restore host that was not previously tested. All of the previous builds should show the same problem.

Since the systemcfg file is not necessary we can probably just avoid loading it in the first place if -XX:+EnableCRIUSupport is used and provide a better experience.

abdulmateen-1 commented 1 month ago

After setting SELinux to permissive option, we did perform three scenarios to verify things worked properly

  1. The 24.0.0.5 liberty we opened the case for with checkpoint taken on a Ubuntu EBC Power9 VM. The restore was successfully completed.
  2. A new image built with 24.0.03 driver on a RHEL Power10 machine with netlink system enabled. The restore successfully completed
  3. An old image built with 24.0.0.3 that's being verified to restore successfully on an OCP Power cluster. We tried this image to verify if it's a regression issue. The image also restored successfully.

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

ymanton commented 1 month ago

As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.

If users use SELinux Enforcing on the checkpoint side they can also keep it on the restore side. On Ubuntu SELinux is typically not used or is used in Permissive mode, so in this case my advice would be to either set up SELinux Enforcing on Ubuntu, or build checkpoint images on RHEL, where SELinux Enforcing is the default.

We can improve this particular case in future JVM releases, but this kind of issue will probably come up again for other reasons.

mtamboli commented 1 month ago

Thanks @ymanton! We still need to figure out why restore is failing on OCP when the images are built on P10 machine which is RHEL and seems to SELInux enabled. Do we know why those are failing? For example, 24003 images built earlier vs build now. Do you think we are disabling SELInux somehow when we took checkpoint agin on 24003 on RHEL based P10 machine?

ymanton commented 4 weeks ago

The problem that's seen when we take a checkpoint on a P10 RHEL machine and fail to restore on OCP is actually a crash late in the restore process. The evidence can be seen in the kernel log:

[1207617.180242] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.180248] Common-Cleaner[3883760]: bad frame in handle_rt_signal64: 00000000e778bd07 nip 00007fff90059720 lr 00007fff90ce0464
[1207617.180251] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.185052] systemd-coredump[3883805]: Failed to get COMM: No such process

The above are Java threads that CRIU attempted to restore, but they hit errors in the kernel when CRIU attempted to run them and the process was terminated.

abdulmateen-1 commented 3 weeks ago

Thanks for the clarification @ymanton. Would we classify this as an application issue or a kernel issue?

ymanton commented 3 weeks ago

Unlikely to be a kernel bug, probably a problem in CRIU and the way we use it. We've seen problems like this before and fixed or worked around them in CRIU, the JVM, or Liberty. I expect something like that here, once I figure out exactly what's going wrong and why.