checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.93k stars 584 forks source link

docker checkpoint creation not working on AWS EBS but when EFS is mounted it works #1902

Open tomjohntaylor opened 2 years ago

tomjohntaylor commented 2 years ago

When I'm trying to do docker checkpoints on AWS EC2 Ubuntu 20.04 LTS with standard gp2 EBS storage I'm getting the error as below, but after mounting default Docker dir to AWS EFS (NFS) it's suddenly starting to work.

~# docker run -d --name c1 alpine:latest sh -c "while true; do date; sleep 2; done"
7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b
~# docker checkpoint create c1 asd
Error response from daemon: Cannot checkpoint container c1: runc did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b/criu-dump.log: unknown
~# tail -17 /run/containerd/io.containerd.runtime.v2.task/moby/7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b/criu-dump.log
(00.005299) Dumping task (pid: 14170)
(00.005302) ========================================
(00.005304) Obtaining task stat ... 
(00.005327) 
(00.005329) Collecting mappings (pid: 14170)
(00.005331) ----------------------------------------
(00.005390) Found regular file mapping, OK
(00.005435) Error (criu/files-reg.c:1710): Can't lookup mount=679 for fd=-3 path=/bin/busybox
(00.005445) Error (criu/cr-dump.c:1524): Collect mappings (pid: 14170) failed with -1
(00.005473) Unlock network
(00.005475) Running network-unlock scripts
(00.005477)     RPC
(00.007890) Unfreezing tasks into 1
(00.007900)     Unseizing 14170 into 1
(00.007910)     Unseizing 14219 into 1
(00.007930) Error (criu/cr-dump.c:2053): Dumping FAILED.

~# mkdir /var/lib/docker_shared
~# mount -t efs -o _netdev,wsize=1048576000,tls,accesspoint=${EFS_ACCESS_POINT|} ${EFS_DNS}:/ /var/lib/docker_shared
~# echo "${EFS_DNS}:/ /var/lib/docker_shared _netdev,wsize=1048576000,tls,accesspoint=${EFS_ACCESS_POINT|} 0 0"  | cat >> /etc/fstab
~# mkdir /var/lib/docker_shared/$EC2_INSTANCE_ID
~# cp /lib/systemd/system/docker.service /etc/systemd/system/
~# sed -i "s/\ -H\ fd:\/\// -g \/var\/lib\/docker_shared\/$EC2_INSTANCE_ID/g"  /etc/systemd/system/docker.service
~# systemctl daemon-reload
~# systemctl restart docker.service

~# docker run -d --name c1 alpine:latest sh -c "while true; do date; sleep 2; done"
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df9b9388f04a: Pull complete 
Digest: sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454
Status: Downloaded newer image for alpine:latest
6367509770ff6503a6d9d9856a58eed78570ce34747d70ef1a67bf2418cebf5c
~# docker checkpoint create c1 asd
asd

~# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/root      ext4       59G  3.5G   55G   7% /
devtmpfs       devtmpfs  3.8G     0  3.8G   0% /dev
tmpfs          tmpfs     3.8G     0  3.8G   0% /dev/shm
tmpfs          tmpfs     774M  980K  774M   1% /run
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/loop0     squashfs   26M   26M     0 100% /snap/amazon-ssm-agent/5656
/dev/loop1     squashfs   56M   56M     0 100% /snap/core18/2344
/dev/loop2     squashfs   62M   62M     0 100% /snap/core20/1434
/dev/loop3     squashfs   45M   45M     0 100% /snap/snapd/15534
/dev/loop4     squashfs   68M   68M     0 100% /snap/lxd/22753
tmpfs          tmpfs     774M     0  774M   0% /run/user/1000
127.0.0.1:/    nfs4      8.0E  9.0M  8.0E   1% /var/lib/checkpoints
127.0.0.1:/    nfs4      8.0E  9.0M  8.0E   1% /var/lib/docker_shared

I tried to use different EBS storage (io1) and it did not fix the problem.

adrianreber commented 2 years ago

There is a bug in Ubuntu concerning overlayfs which you seem to be hitting. If you try to upgrade to the latest kernel the bug may be fixed. Not sure. If you try another distribution you should not hit this bug. It is Ubuntu only.

avagin commented 2 years ago

What kernel do you use? Could you show /proc/pid/mountinfo from a container?

tomjohntaylor commented 2 years ago

5.13.0-1025-aws 5.13.0-1023-aws cat: /proc/pid: No such file or directory

avagin commented 2 years ago

cat: /proc/pid: No such file or directory

docker exec NAME cat /proc/1/mountinfo

I think @adrianreber is right, it is the known issue of the ubuntu kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.