checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.95k stars 592 forks source link

amdgpu_plugin: Failed to dump (ret:-22) #2248

Open rst0git opened 1 year ago

rst0git commented 1 year ago

The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.

  1. When checkpointing a CRI-O container running AlexNet CNN on Ubuntu 20.04 system with MI100, CRIU fails with
(00.208098) amdgpu_plugin: Thread[0x5bb8] started
(00.208503) amdgpu_plugin: amdgpu-pages-252-5bb8.img:Opened file for write with size:33158160384
(02.766607) Error (criu/parasite-syscall.c:88): si_code=2 si_pid=1752711 si_status=9
(02.767880) Error (criu/parasite-syscall.c:93): 1752767 was killed by 9 unexpectedly: Killed

K8s yaml file: alexnet.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt

  1. When attempting to checkpoint CRI-O containers on CentOS 9 system with two MI210 GPUs, CRIU fails with (the CRIU logs were generated with the patch below to show the value of id_map->src and src_id)
    (00.171135) amdgpu_plugin: Number of CPUs:3 GPUs:1
    (00.171143) id_map->src: 9704; id_map->dest: 9704; src_id: 39309
    (00.171147) Error (amdgpu_plugin.c:322): amdgpu_plugin: maps_get_dest_gpu failed 0
    (00.171157) amdgpu_plugin: Dumped devices Failed (ret:-22)
    (00.171179) amdgpu_plugin: Process unpaused Ok (ret:0)
    (00.171243) Error (amdgpu_plugin.c:1456): amdgpu_plugin: Failed to dump (ret:-22)
    (00.171296) ----------------------------------------
    (00.171417) Error (criu/cr-dump.c:1669): Dump files (pid: 646845) failed with -1

    K8s yaml file: binomial-option.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt

--- a/plugins/amdgpu/amdgpu_plugin_topology.c
+++ b/plugins/amdgpu/amdgpu_plugin_topology.c
@@ -265,6 +265,7 @@ uint32_t maps_get_dest_gpu(const struct device_maps *maps, const uint32_t src_id
        struct id_map *id_map;

        list_for_each_entry(id_map, &maps->gpu_maps, listm) {
+               pr_debug("id_map->src: %d; id_map->dest: %d; src_id: %d\n", id_map->src, id_map->dest, src_id);
                if (id_map->src == src_id)
                        return id_map->dest;
        }

In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the criu-dev branch.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

fdavid-amd commented 1 year ago

CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.