Open DragonHunter274 opened 1 month ago
Looks like CRIU fails to checkpoint the container here. Can you post a few details about your system? OS, kernel version, container image and pod config used (is this just the nginx example?). Can you also try to increase the scaledown duration a bit, something like: zeropod.ctrox.dev/scaledown-duration: 30s
. Just to give the pod a bit longer after startup before it tries to first checkpoint it.
OS is Ubuntu 22.04.4 LTS in a proxmox LXC container kernel is 5.15.107-2-pve the pod is the unmodified nginx example
increasing the scaledown duration didn't help
I found some more debug output I missed last time
most notably (00.101208) Error (compel/src/lib/ptrace.c:27): suspending seccomp failed: Operation not permitted
criu ckeck
returns Error (criu/config.c:1031): Invalid value for --network-lock: skip
not sure if that's relevant
Thanks for the full CRIU log, that helps a lot.
OS is Ubuntu 22.04.4 LTS in a proxmox LXC container
So the k3s node is running inside an LXC container? That might complicate things here but I'm not sure.
Did you by any chance configure a seccomp default profile that enables seccomp for all containers?
Regardless, can you try explicitly disabling seccomp for the nginx pod?
spec:
template:
spec:
containers:
- image: nginx
name: nginx
ports:
- containerPort: 80
# add this
securityContext:
seccompProfile:
type: Unconfined
criu ckeck returns Error (criu/config.c:1031): Invalid value for --network-lock: skip not sure if that's relevant
That is probably just because you ran an older version of criu
from the OS which does not know about this option yet. You can run the check with the criu
binary that zeropod installs like this:
LD_LIBRARY_PATH=/opt/zeropod/lib/ /opt/zeropod/bin/criu check
yes, the k3s node is running inside a lxc container
setting seccomp to unconfined still results in the same error message, criu check
returns Looks good.
EDIT: criu check --all
returns
Error (criu/cr-check.c:759): couldn't suspend seccomp: Operation not permitted
Error (criu/cr-check.c:802): Dumping seccomp filters not supported: Permission denied
Error (criu/tun.c:66): tun: Can't check tun support: No such file or directory
Warn (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
I'm pretty sure this is caused by LXC applying seccomp filters to all processes running within the container. CRIU does not have the ability to ignore seccomp filters during checkpoint/restore (https://github.com/checkpoint-restore/criu/issues/2143), so I'm afraid the only way to get zeropod running within that LXC container (barring other roadblocks) would be to simply disable seccomp. I have never really used/configured LXC before but it looks like disabling seccomp is not that straight-forward: https://lists.linuxcontainers.org/pipermail/lxc-users/2020-June/015265.html
I tried disabling seccomp by providing an empty blacklist as above but it didn't change anything but I agree It's probably lxc soing something weird here
After fixing the zeropod deployment I am facing this issue:
The process exists though
root 4050438 0.0 0.0 11416 8028 ? Ss 19:08 0:00 nginx: master process nginx -g daemon off;