checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.92k stars 584 forks source link

How to use the "--enable-external-masters" option in the Docker checkpoint feature (integrated with CRIU)? #2472

Closed nwpuhkp closed 1 month ago

nwpuhkp commented 1 month ago

Description

When I tried to create a checkpoint for a Docker container that is using a GPU, I encountered the error:

"Error response from daemon: Cannot checkpoint container xxxxx: nvidia-container-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/a0a79717b8f17ca7c40357c7c16e38e93b2ae7d634c923f2e3d94286a6132abc/criu-dump.log: unknown".

According to the log file, it suggests trying the --enable-external-masters option, but this option cannot be used directly in the Docker checkpoint command. What should I do to enable Docker to create a checkpoint normally?

Steps to reproduce the issue: The command I executed is: docker checkpoint create xxxxxx checkpoint1 --leave-running=True

Describe the results you received: docker checkpoint create autoware-hkp checkpoint1 --leave-running=True --enable-external-masters unknown flag: --enable-external-masters See 'docker checkpoint create --help'.

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

CRIU logs and information:

CRIU full dump/restore logs:

``` (00.012402) mnt: <-- (00.012403) mnt: The mount 2449 is bind for 2450 (@./dev/nvidia-uvm-tools -> @./dev/nvidia0) (00.012404) mnt: The mount 2448 is bind for 2450 (@./dev/nvidia-uvm -> @./dev/nvidia0) (00.012405) mnt: The mount 2447 is bind for 2450 (@./dev/nvidiactl -> @./dev/nvidia0) (00.012406) mnt: The mount 2444 is bind for 2445 (@./lib/firmware/nvidia/555.42.06/gsp_ga10x.bin -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012406) mnt: The mount 2440 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012407) mnt: The mount 2439 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012408) mnt: The mount 2438 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012408) mnt: The mount 2437 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012409) mnt: The mount 2436 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012410) mnt: The mount 2435 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvoptix.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012410) mnt: The mount 2434 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012411) mnt: The mount 2433 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012411) mnt: The mount 2432 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012412) mnt: The mount 2431 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-tls.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012413) mnt: The mount 2430 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012413) mnt: The mount 2429 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012414) mnt: The mount 2428 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012415) mnt: The mount 2427 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012415) mnt: The mount 2426 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012416) mnt: The mount 2425 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012418) mnt: The mount 2424 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012419) mnt: The mount 2423 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012419) mnt: The mount 2422 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012420) mnt: The mount 2421 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libcudadebugger.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012421) mnt: The mount 2420 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libcuda.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012421) mnt: The mount 2419 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012422) mnt: The mount 2418 is bind for 2445 (@./usr/lib/x86_64-linux-gnu/libnvidia-ml.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012422) mnt: The mount 2417 is bind for 2445 (@./usr/bin/nvidia-cuda-mps-server -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012423) mnt: The mount 2416 is bind for 2445 (@./usr/bin/nvidia-cuda-mps-control -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012424) mnt: The mount 2415 is bind for 2445 (@./usr/bin/nvidia-persistenced -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012424) mnt: The mount 2414 is bind for 2445 (@./usr/bin/nvidia-debugdump -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012425) mnt: The mount 2413 is bind for 2445 (@./usr/bin/nvidia-smi -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012426) mnt: The mount 2410 is bind for 2445 (@./usr/share/vulkan/implicit_layer.d/nvidia_layers.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012426) mnt: The mount 2409 is bind for 2445 (@./usr/share/vulkan/icd.d/nvidia_icd.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012427) mnt: The mount 2408 is bind for 2445 (@./usr/share/glvnd/egl_vendor.d/10_nvidia.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012427) mnt: The mount 2407 is bind for 2445 (@./usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012428) mnt: The mount 2406 is bind for 2445 (@./usr/share/X11/xorg.conf.d/10-nvidia.conf -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012429) mnt: The mount 2405 is bind for 2445 (@./lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012429) mnt: The mount 2404 is bind for 2445 (@./lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012430) mnt: The mount 2389 is bind for 2445 (@./lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012431) mnt: The mount 2388 is bind for 2445 (@./home/hkp/.Xauthority -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012431) mnt: The mount 2387 is bind for 2445 (@./home/autoware/autoware-contents -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012432) mnt: The mount 2386 is bind for 2445 (@./tmp/fuzzerdata -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012433) mnt: The mount 2385 is bind for 2445 (@./tmp/.X11-unix -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012433) mnt: The mount 2384 is bind for 2445 (@./etc/resolv.conf -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012434) mnt: The mount 2383 is bind for 2445 (@./etc/hosts -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012435) mnt: The mount 2382 is bind for 2445 (@./etc/hostname -> @./lib/firmware/nvidia/555.42.06/gsp_tu10x.bin) (00.012435) mnt: The mount 2442 is bind for 2443 (@./usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129) (00.012436) mnt: The mount 2441 is bind for 2443 (@./usr/lib/x86_64-linux-gnu/libcuda.so.410.129 -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129) (00.012438) mnt: The mount 2375 is bind for 2443 (@./ -> @./usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129) (00.012445) mnt: Found /dev/nvidia0 mapping for ./dev/nvidia0 mountpoint (00.012446) mnt: Found /dev/nvidia-uvm-tools mapping for ./dev/nvidia-uvm-tools mountpoint (00.012448) mnt: Found /dev/nvidia-uvm mapping for ./dev/nvidia-uvm mountpoint (00.012449) mnt: Found /dev/nvidiactl mapping for ./dev/nvidiactl mountpoint (00.012477) mnt: Found /usr/share/vulkan/implicit_layer.d/nvidia_layers.json mapping for ./usr/share/vulkan/implicit_layer.d/nvidia_layers.json mountpoint (00.012478) mnt: Found /usr/share/vulkan/icd.d/nvidia_icd.json mapping for ./usr/share/vulkan/icd.d/nvidia_icd.json mountpoint (00.012480) mnt: Found /usr/share/glvnd/egl_vendor.d/10_nvidia.json mapping for ./usr/share/glvnd/egl_vendor.d/10_nvidia.json mountpoint (00.012481) mnt: Found /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json mapping for ./usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json mountpoint (00.012482) mnt: Found /usr/share/X11/xorg.conf.d/10-nvidia.conf mapping for ./usr/share/X11/xorg.conf.d/10-nvidia.conf mountpoint (00.012483) mnt: Found /lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so mapping for ./lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so mountpoint (00.012485) mnt: Found /lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 mapping for ./lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.555.42.06 mountpoint (00.012486) mnt: Found /sys/fs/cgroup/cpuset mapping for ./sys/fs/cgroup/cpuset mountpoint (00.012487) mnt: Found /sys/fs/cgroup/blkio mapping for ./sys/fs/cgroup/blkio mountpoint (00.012489) mnt: Found /sys/fs/cgroup/rdma mapping for ./sys/fs/cgroup/rdma mountpoint (00.012490) mnt: Found /sys/fs/cgroup/perf_event mapping for ./sys/fs/cgroup/perf_event mountpoint (00.012491) mnt: Found /sys/fs/cgroup/pids mapping for ./sys/fs/cgroup/pids mountpoint (00.012493) mnt: Found /sys/fs/cgroup/net_cls,net_prio mapping for ./sys/fs/cgroup/net_cls,net_prio mountpoint (00.012494) mnt: Found /sys/fs/cgroup/devices mapping for ./sys/fs/cgroup/devices mountpoint (00.012495) mnt: Found /sys/fs/cgroup/misc mapping for ./sys/fs/cgroup/misc mountpoint (00.012496) mnt: Found /sys/fs/cgroup/cpu,cpuacct mapping for ./sys/fs/cgroup/cpu,cpuacct mountpoint (00.012498) mnt: Found /sys/fs/cgroup/memory mapping for ./sys/fs/cgroup/memory mountpoint (00.012499) mnt: Found /sys/fs/cgroup/freezer mapping for ./sys/fs/cgroup/freezer mountpoint (00.012500) mnt: Found /sys/fs/cgroup/hugetlb mapping for ./sys/fs/cgroup/hugetlb mountpoint (00.012501) mnt: Found /sys/fs/cgroup/systemd mapping for ./sys/fs/cgroup/systemd mountpoint (00.012503) mnt: Found /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 mapping for ./lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1 mountpoint (00.012505) mnt: Found /home/hkp/.Xauthority mapping for ./home/hkp/.Xauthority mountpoint (00.012506) mnt: Found /home/autoware/autoware-contents mapping for ./home/autoware/autoware-contents mountpoint (00.012507) mnt: Found /tmp/fuzzerdata mapping for ./tmp/fuzzerdata mountpoint (00.012509) mnt: Found /tmp/.X11-unix mapping for ./tmp/.X11-unix mountpoint (00.012510) mnt: Found /etc/resolv.conf mapping for ./etc/resolv.conf mountpoint (00.012511) mnt: Found /etc/hosts mapping for ./etc/hosts mountpoint (00.012513) mnt: Found /etc/hostname mapping for ./etc/hostname mountpoint (00.012518) mnt: Inspecting sharing on 2451 shared_id 0 master_id 15 (@./proc/driver/nvidia/gpus/0000:01:00.0) (00.012519) Error (criu/mount.c:1088): mnt: Mount 2451 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters. ```

Output of `criu --version`:

``` Version: 3.19 ```

Output of `criu check --all`:

``` Can't check shutdown state of inet socket Warn (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```

Additional environment details: Ubuntu 20.04 Docker version 24.0.5, build 24.0.5-0ubuntu1~20.04.1

rst0git commented 1 month ago

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

@nwpuhkp This is a known problem. Docker, containerd and CRI-O do not currently support checkpoint/restore with NVIDIA GPUs using the CUDA plugin for CRIU.

nwpuhkp commented 1 month ago

Error (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) has unreachable sharing. Try --enable-external-masters.错误 (criu/mount.c:1088): mnt: Mount 2450 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 15 shared_id: 0) 无法访问共享。尝试 --enable-external-masters。

@nwpuhkp This is a known problem. Docker, containerd and CRI-O do not currently support checkpoint/restore with NVIDIA GPUs using the CUDA plugin for CRIU.@nwpuhkp 这是一个已知问题。Docker、containerd 和 CRI-O 目前不支持使用 CRIU 的 CUDA 插件对 NVIDIA GPU 进行检查点/恢复。

Okay, it's because I haven't fully understood the scope of CRIU's functionality. Thank you for your reply. I will look for other solutions.