checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.87k stars 582 forks source link

crun criu checkpoint failed #1291

Closed eskuai closed 3 years ago

eskuai commented 3 years ago

Hello,

I am following the instructions from https://technology.amis.nl/2018/04/08/first-steps-with-docker-checkpoint-to-create-and-restore-snapshots-of-running-containers/ I got

[root@k8s-master ~]# docker checkpoint create  --leave-running=true cr checkpoint0
Error response from daemon: Cannot checkpoint container cr: /usr/local/bin/crun did not terminate sucessfully: unknown command checkpoint path= /run/containerd/io.containerd.runtime.v1.linux/moby/2703213fe2b0f727e5cfecad9de32f3d1e1340301aab43e5a1c1509d019cc57e/criu-dump.log: unknown

it logs:

cr: /usr/local/bin/crun did not terminate sucessfully

I dont know it is a crun problem ...

my config:

crun version 0.15.1
commit: eb0145e5ad4d8207e84a327248af76663d4e50dd
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL

docker daemon.json:

{
    "data-root": "/data01/dockerimages",
    "insecure-registries": ["k8s-master:5000"],
    "max-concurrent-uploads": 10,
    "max-concurrent-downloads": 20,
    "runtimes": { "crun": { "path": "/usr/local/bin/crun", "runtimeArgs": [] } },
    "default-runtime": "crun",
    "experimental": true
} 

docker info shows:

[root@k8s-master ~]# docker info
Client:
 Debug Mode: false

Server:
 Containers: 14
  Running: 12
  Paused: 0
  Stopped: 2
 Images: 10
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: crun runc
 Default Runtime: crun
 Init Binary: /usr/libexec/docker/docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: eb0145e5ad4d8207e84a327248af76663d4e50dd
 init version: N/A (expected: fec3683b971d9c3ef73f284f176672c44b448662)
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.9.9
 Operating System: Red Hat Enterprise Linux 8.3 (Ootpa)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.36GiB
 Name: k8s-master
 ID: YEVT:VBCY:AL3C:TAOI:FUIP:3QA4:LVDF:XWCD:R2MA:JFT5:6RS3:O37V
 Docker Root Dir: /data01/dockerimages
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  k8s-master:5000
  127.0.0.0/8
 Live Restore Enabled: true

criu version

[root@k8s-master ~]# criu --version
Version: 3.12

If i check,

[root@k8s-master ~]# criu dump -t 2703213fe2b0
Warn  (compel/src/lib/infect.c:127): Unable to interrupt task: 2703213 (No such process)
Error (compel/src/lib/infect.c:343): Unable to detach from 2703213: No such process
Error (criu/cr-dump.c:1742): Dumping FAILED.

and container is running ok

[root@k8s-master ~]# docker logs 2703213fe2b0
997
998
999
1000
1001
1002
1003
1004
1005
1006

can i ask to here for help or is it a crun issue?

Tx

adrianreber commented 3 years ago

It sounds like a crun problem. But as I wrote the crun CRIU integration I would also answer there.

But, I never tested crun with docker only with Podman and crun needs at least CRIU 3.15 to work correctly and the version of crun you are using (0.15.1) does not have full crun support: https://github.com/containers/crun/blob/0.15.1/src/crun.c#L121

So your CRIU version is too old to work with crun. Your crun version is too old and does not correctly support checkpoint/restore and the crun CRIU integration has never been tested with Docker.

Lot's of things you need to fix before you can try it out. Also it looks like your are running RHEL 8.3 with a 5.9.9 kernel. Your are mixing a lot of things which are not tested together. If you are using RHEL 8.3 use the RHEL kernel, use Podman from RHEL and runc instead of crun and it should work with CRIU from RHEL 8.3 (which should also be newer than 3.12).

Closing as it is not a CRIU error, but just the wrong combination of many things.