Open rst0git opened 1 year ago
A friendly reminder that this issue had no activity for 30 days.
CRIU gets data about the system's GPUs form two places: the PROCESS_INFO CRIU ioctl, and the /sys/class/kfd/kfd/topology sysfs entry. Somehow, on this system, these disagree with each other about the number of devices there are and what their IDs are.
The following two errors occur when checkpointing GPU applications with the AMD GPU plugin for CRIU.
K8s yaml file: alexnet.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt
id_map->src
andsrc_id
)K8s yaml file: binomial-option.yaml Full CRIU log file: criu.log Hardware configuration: lshw.txt
In both cases, we use Kubernetes v1.27.4, CRI-O v1.26.0, AMD GPU device plugin, and CRIU compiled from the
criu-dev
branch.