checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.79k stars 565 forks source link

amdgpu_plugin: Failed to dump (ret:-22) #2248

Open rst0git opened 11 months ago

rst0git commented 11 months ago

The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.

  1. When checkpointing a CRI-O container running AlexNet CNN on Ubuntu 20.04 system with MI100, CRIU fails with
(00.208098) amdgpu_plugin: Thread[0x5bb8] started
(00.208503) amdgpu_plugin: amdgpu-pages-252-5bb8.img:Opened file for write with size:33158160384
(02.766607) Error (criu/parasite-syscall.c:88): si_code=2 si_pid=1752711 si_status=9
(02.767880) Error (criu/parasite-syscall.c:93): 1752767 was killed by 9 unexpectedly: Killed

K8s yaml file: alexnet.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt

  1. When attempting to checkpoint CRI-O containers on CentOS 9 system with two MI210 GPUs, CRIU fails with (the CRIU logs were generated with the patch below to show the value of id_map->src and src_id)
    (00.171135) amdgpu_plugin: Number of CPUs:3 GPUs:1
    (00.171143) id_map->src: 9704; id_map->dest: 9704; src_id: 39309
    (00.171147) Error (amdgpu_plugin.c:322): amdgpu_plugin: maps_get_dest_gpu failed 0
    (00.171157) amdgpu_plugin: Dumped devices Failed (ret:-22)
    (00.171179) amdgpu_plugin: Process unpaused Ok (ret:0)
    (00.171243) Error (amdgpu_plugin.c:1456): amdgpu_plugin: Failed to dump (ret:-22)
    (00.171296) ----------------------------------------
    (00.171417) Error (criu/cr-dump.c:1669): Dump files (pid: 646845) failed with -1

    K8s yaml file: binomial-option.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt

--- a/plugins/amdgpu/amdgpu_plugin_topology.c
+++ b/plugins/amdgpu/amdgpu_plugin_topology.c
@@ -265,6 +265,7 @@ uint32_t maps_get_dest_gpu(const struct device_maps *maps, const uint32_t src_id
        struct id_map *id_map;

        list_for_each_entry(id_map, &maps->gpu_maps, listm) {
+               pr_debug("id_map->src: %d; id_map->dest: %d; src_id: %d\n", id_map->src, id_map->dest, src_id);
                if (id_map->src == src_id)
                        return id_map->dest;
        }

In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the criu-dev branch.

github-actions[bot] commented 10 months ago

A friendly reminder that this issue had no activity for 30 days.

fdavid-amd commented 9 months ago

CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.