Closed WhaleSpring closed 1 year ago
Please provide the yaml you use to create your pod initially and also for restore.
I also need to see the criu log files. restore.log. Using the criu configuration file you can create the log file somewhere else so that it is not deleted.
@adrianreber I get the log file with such context:
(00.000000) Unable to get $HOME directory, local configuration file will not be used.
(00.000000) Parsing config file /etc/criu/runc.conf
(00.000100) Version: 3.17.1 (gitid 0)
(00.000114) Running on node1 Linux 3.10.0-1160.90.1.el7.x86_64 #1 SMP Thu May 4 15:21:22 UTC 2023 x86_64
(00.000117) Would overwrite RPC settings with values from /etc/criu/runc.conf
(00.000149) Loaded kdat cache from /run/criu.kdat
(00.000197) Hugetlb size 2 Mb is supported but cannot get dev's number
(00.000211) Hugetlb size 1024 Mb is supported but cannot get dev's number
(00.000621) Added ipc:/var/run/ipcns/ca89c2be-d87d-45e4-84b3-a858c1ac9bf0 join namespace
(00.000644) Added uts:/var/run/utsns/ca89c2be-d87d-45e4-84b3-a858c1ac9bf0 join namespace
(00.000684) Parsing config file /etc/criu/runc.conf
(00.000729) mnt-v2: Mounts-v2 requires MOVE_MOUNT_SET_GROUP support
(00.000735) Mount engine fallback to --mntns-compat-mode mode
(00.000755) rlimit: RLIMIT_NOFILE unlimited for self
(00.001041) Error (criu/lsm.c:411): selinux LSM specified but selinux not supported by kernel
It seems be the problem of selinux or LSM. I still don't understand what's wrong,for I really don't learning criu too long.
(00.001041) Error (criu/lsm.c:411): selinux LSM specified but selinux not supported by kernel
Has the destination host the same OS?
Probably not. You cannot migrate from a host with selinux to a host without selinux.
Has the destination host the same OS?
Yes,my hosts are all centos7,and hosts that can't restore successfully can't even restore pods in place after checkpoint(no cross-host restore)
Probably not. You cannot migrate from a host with selinux to a host without selinux.
For the sake that I have a host which can restore successsfully , maybe I should compare the situation of selinux between the host which can restore a pod and the hosts which can't. But how? As far as I can tell, they are the same.
@adrianreber Oh! I found the kernel version of the servers are not the same:
The host with kernel-3.10.0-957.el7.x86_64
can restore successfully, when the hosts with e can't restore a pod. Is that the problem? Maybe kernel-3.10.0-1160.90.1.el7.x86_64
can't be the host's kernel version to restore a pod ? I will run further tests to see if this is the case.
Maybe you have selinux disabled on.one host.
Maybe you have selinux disabled on.one host.
Both the selinux status in hosts that can or can't are all as follows, so it's not the disabled
question:
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: permissive
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 31
Well, CRIU thinks SELinux is disabled. So something must be different. Anyway, do not use CentOS 7. It is too old.
Hi! @adrianreber ! Thank you for your previous guidance! Following your advice, I changed my system. Now I am doing my experiment on the k8s cluster composed of several Ubuntu 22.04.03 virtual machines! But the restore process still failed. Here are the logs I got: criu3.log The problem I encountered seems different from MaxFuhrich which is a cgroupv2 problem. If you are free, could you please help me see how I can adjust the configuration of the virtual machine to ensure that the restore operation can proceed correctly?
You get a segfault during restore which is usually a sign of problems with restartable sequences.
You need at least CRIU 3.17.
You get a segfault during restore which is usually a sign of problems with restartable sequences.
You need at least CRIU 3.17.
That's precisely the issue. I upgraded CRIU from version 3.16.1 to version 3.17.1, and everything is working fine now. Thank you very much!
It's worth noting that Kubernetes logs previously advised me that the minimum required CRIU version for C/R operations was 3.16, which led to a misguidance in my version selection for attempt this time.
It's worth noting that Kubernetes logs previously advised me that the minimum required CRIU version for C/R operations was 3.16, which led to a misguidance in my version selection for attempt this time.
Well, the functionality to checkpoint/restore containers in Kubernetes requires 3.16. If your distribution uses restartable sequences you need 3.17. This was just a bug in your distribution. Ubuntu comes with a known broken CRIU version. There is a bug open for it somewhere but they ignore it.
Please close this issue if your problem is solved.
What happened?
When I try to run a container by a.tar which produced by the ability that kubelet provide , it was wrong:
The pod restore failed, kubectl describe could get such messages:
It seems that the criu try to restore the pod according to the container_name ‘m’ ,but ofcourse the container's path is using its ID instead name in the path mentioned above. Because
container file names
in the path mentioned in messages arecontainer IDs
in the node. I think there may be some settings not being set correctly, because I have a node can restore pod, but I don't know what's the different between it and other machines. It is necessary to add that I have one nodeNode1
can checkpoint and restore a container , but other nodes can only checkpoint a container but can't restore one . Meanwhile , a image built by the .tar file from the nodes which can't restore container can be restored atNode1
. This tells me the checkpoint is correct at other nodes but restore is wrong. I don't know whyNode1
can restore and why the other nodes can't do it.It's neccessay to point that my OS is CentOS7 and when I try to use the newest Ubuntu, there is the same problem.
What did you expect to happen?
I want to live migrate a pod , but when I try to restore a pod, I meet problem above . For what I can only restore pods in one mechine of my cluster.
How can we reproduce it (as minimally and precisely as possible)?
At first, I use checkpoint to get a checkpoint file:
curl -sk -X POST "https://localhost:10250/checkpoint/<namespace>/<pod_name>/<conatiner_name>" \ --key /etc/kubernetes/pki/apiserver-kubelet-client.key \ --cacert /etc/kubernetes/pki/ca.crt \ --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt
And then I build image based on thecheckpoint.tar
file and push it to the image registry. At last, I deploy the image as a new pod. If I chose myNode1
then the pod will be restored successfully, but other nodes failed with messages above.Anything else we need to know?
No response
CRI-O and Kubernetes version
OS version
Additional environment details (AWS, VirtualBox, physical, etc.)