Open abdulmateen-1 opened 1 month ago
FYI, this is being discussed in the private instanton Slack channel with Java/Semeru team. Depending on the discussion, this might get moved elsewhere: https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1715103170419049
We tested different Test case scenarios to properly determine the problem is from
The restore was successful on the P9 VM machine but failed on the OCP-P9 cluster.
(00.059541) pie: 1025: mmap(0x7a1183240000 -> 0x7a1183250000, 0x3 0x12 398)
(00.059548) pie: 1025: mmap(0x7a1183250000 -> 0x7a11832b0000, 0x7 0x12 399)
(00.059554) pie: 1025: mmap(0x7a11832b0000 -> 0x7a11832c0000, 0x3 0x12 399)
(00.059560) pie: 1025: mmap(0x7a11832c0000 -> 0x7a11832d0000, 0x3 0x12 399)
(00.059567) pie: 1025: mmap(0x7a11832d0000 -> 0x7a11833a0000, 0x3 0x32 -1)
(00.059573) pie: 1025: mmap(0x7a11833a0000 -> 0x7a11833b0000, 0x0 0x32 -1)
(00.059578) pie: 1025: mmap(0x7a11833b0000 -> 0x7a11833f0000, 0x3 0x32 -1)
(00.059583) pie: 1025: mmap(0x7a11833f0000 -> 0x7a1183400000, 0x1 0x11 400)
(00.059717) pie: 1025: Error (criu/pie/restorer.c:1676): Can't restore 0x7a11833f0000 mapping with 0xfffffffffffffff3
(00.059728) pie: 1025: Error (criu/pie/restorer.c:2102): Restorer fail 1025
(00.059965) Error (criu/cr-restore.c:2547): Restoring FAILED.
restore.log
ebuy-instanton-1-76df794d95-bxf4l-app.log
The checkpoint on the P10 machine initially failed with this error.
[ERROR ] CWWKE0701E: bundle com.ibm.ws.tx.embeddable.jakarta:1.0.89.cl240620240507-2000 (455)[com.ibm.tx.jta.embeddable.impl.EmbeddableTMHelper(314)] : The shutdown method has thrown an exception java.lang.NullPointerException: Cannot invoke "com.ibm.tx.jta.impl.TxRecoveryAgentImpl.stop(boolean)" because "com.ibm.tx.jta.util.TxTMHelper._recoveryAgent" is null
at com.ibm.tx.jta.util.TxTMHelper.shutdown(TxTMHelper.java:545)
at [internal classes]
[AUDIT ] CWWKE0036I: The server defaultServer stopped after 8.602 seconds.
CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (compel/src/lib/infect.c:133): Unable to interrupt task: 1117 (Operation not permitted)
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Warn (criu/libnetlink.c:84): Can't send request message
Error (criu/sockets.c:436): sockets: Sockets (family 1 proto 0) are not collected
Error (criu/sk-unix.c:351): unix: Unix socket 1351736 not found
Error (criu/cr-dump.c:1669): Dump files (pid: 1023) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.
CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command.
Following the request by InstantOn team, On a regression scenario to taking checkpoint on a P10 image and restore: Try the getsebool virt_sandbox_use_netlink command, if that returns 0 do setsebool, virt_sandbox_use_netlink 1 and try the checkpoint again.
We ran the below command to get it working on a P10 machine.
getsebool virt_sandbox_use_netlink command, if that returns 0
do setsebool virt_sandbox_use_netlink 1 and try the checkpoint again
this time the restore was successful on the P10 image, but the restore on OCP-P9 image failed but with a different error.
(00.143171) 1023 (native) is going to execute the syscall 91, required is 172
(00.143190) 1023 was trapped
(00.143200) 1023 (native) is going to execute the syscall 172, required is 172
(00.143225) 1023 was stopped
(00.143262) 1023 was trapped
(00.143274) 1023 (native) is going to execute the syscall 91, required is 91
(00.143335) 1023 was stopped
(00.143352) Run late stage hook from criu master for external devices
(00.143359) restore late stage hook for external plugin failed
(00.143367) Run late stage hook from criu master for external devices
(00.143374) restore late stage hook for external plugin failed
(00.143381) Running pre-resume scripts
(00.146642) Error (criu/cr-restore.c:2146): Unable to detach 1077: No such process
(00.146674) Error (criu/cr-restore.c:2520): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.147040) Error (criu/cr-restore.c:2547): Restoring FAILED.
ebuy-instanton-1-76df794d95-zfl99-app.log
This time we tried to deploy a different application, to see if the error had to do with the app. The restore was successful on the OCP-P9 cluster this time around. ebuy-instanton-1-76df794d95-bxf4l-app.log
Old checkpoint images which were taken using WLO 24.0.0.3 are getting restored fine on OCP -P 4.15 cluster for eBuy.
A new image was built using 24.0.0.3 driver. The restore was successful on the on-prem VM but failed on the OCP instance. ebuy-instanton-1-7b64849466-nqnn2-app.log
On Power Linux platforms OpenJ9 loads a special file called /proc/powerpc/systemcfg
. This is a virtual file provided by the kernel and it contains detailed info about the system. We use some of the info to provide faster time keeping functions. Loading of this file is optional; if we fail to load it we gracefully fall back.
On systems that have SELinux set to Enforcing mode access to this file is blocked. This can be seen in the audit log /var/log/audit/audit.log
:
type=AVC msg=audit(1715833184.627:1597): avc: denied { map } for pid=1316985 comm="java" path="/proc/powerpc/systemcfg" dev="proc" ino=4026531945 scontext=system_u:system_r:container_t:s0:c30,c433 tcontext=system_u:object_r:proc_t:s0 tclass=file permissive=0
This problem is happening because the checkpoint was taken on a system where access to this file was not blocked (Ubuntu 22.04, SELinux either not installed, disabled, or set to Permissive mode) and the restore was attempted on a system where access is blocked (OCP 4.15, SELinux set to Enforcing mode).
The older checkpoint image taken with the old 24.0.0.3 driver that can still be restored on OCP works because the JVM did not load the systemcfg file. That checkpoint was taken on a RHEL 9 system, not an Ubuntu system, and that system likely had SELinux set to Enforcing mode, which prevented us from loading systemcfg in the first place. That checkpoint can be restored on OCP without any problem because we don't try to reload systemcfg.
This is also why taking a new checkpoint on Ubuntu with the old 24.0.0.3 driver still fails. If we set SELinux to Enforcing mode on Ubuntu, or take the checkpoint on a RHEL 9 machine, it will likely restore fine on OCP. Alternatively, disabling SELinux or setting it to Permissive mode on OCP allows the Ubuntu checkpoint to restore (I tested this scenario).
My conclusion is that this is not a regression, just a combination of checkpoint host and restore host that was not previously tested. All of the previous builds should show the same problem.
Since the systemcfg file is not necessary we can probably just avoid loading it in the first place if -XX:+EnableCRIUSupport
is used and provide a better experience.
After setting SELinux to permissive option, we did perform three scenarios to verify things worked properly
As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.
As customers wouldn't want to disable SELinux on their OCP cluster because that isn't best practice what would be the best practice here for a customer to do when building a checkpoint image.
If users use SELinux Enforcing on the checkpoint side they can also keep it on the restore side. On Ubuntu SELinux is typically not used or is used in Permissive mode, so in this case my advice would be to either set up SELinux Enforcing on Ubuntu, or build checkpoint images on RHEL, where SELinux Enforcing is the default.
We can improve this particular case in future JVM releases, but this kind of issue will probably come up again for other reasons.
Thanks @ymanton! We still need to figure out why restore is failing on OCP when the images are built on P10 machine which is RHEL and seems to SELInux enabled. Do we know why those are failing? For example, 24003 images built earlier vs build now. Do you think we are disabling SELInux somehow when we took checkpoint agin on 24003 on RHEL based P10 machine?
The problem that's seen when we take a checkpoint on a P10 RHEL machine and fail to restore on OCP is actually a crash late in the restore process. The evidence can be seen in the kernel log:
[1207617.180242] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.180248] Common-Cleaner[3883760]: bad frame in handle_rt_signal64: 00000000e778bd07 nip 00007fff90059720 lr 00007fff90ce0464
[1207617.180251] Attach API wait[3883768]: bad frame in handle_rt_signal64: 00000000f27d5730 nip 00007fff90a3506c lr 00007fff8bfca234
[1207617.185052] systemd-coredump[3883805]: Failed to get COMM: No such process
The above are Java threads that CRIU attempted to restore, but they hit errors in the kernel when CRIU attempted to run them and the process was terminated.
Thanks for the clarification @ymanton. Would we classify this as an application issue or a kernel issue?
Unlikely to be a kernel bug, probably a problem in CRIU and the way we use it. We've seen problems like this before and fixed or worked around them in CRIU, the JVM, or Liberty. I expect something like that here, once I figure out exactly what's going wrong and why.
Java Defect: https://github.com/eclipse-openj9/openj9/issues/19511
Hello team, we're on Liberty 24.0.0.5 and Java 21. We did a checkpoint on an EBC P9 VM and did a restore on an OCP P9 VM which failed failed with the error below.
The lscpu from OCP Power9
The result of uname -r on both machines
To get more trace we set the CRIU logging level to 4 and I included the output of the log file. ebuy-instanton-1-597864cdcd-fk8lz-app.log