eth-cscs / sarus

OCI-compatible engine to deploy Linux containers on HPC environments.
https://sarus.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
127 stars 10 forks source link

AMD GPU support #20

Open haampie opened 3 years ago

haampie commented 3 years ago

Adds a hook for AMD GPUs, which currently just mounts /dev/dri and /dev/kfd as advocated by AMD.

Hook can be enabled through the following flag:

sarus run --amdgpu [container] [cmd]

It will just fail when /dev/dri or /dev/kfd does not exist or can't be mounted.

haampie commented 3 years ago

Hi @Madeeks, I haven't tested this for multiple GPUs, but in principle it should work. Every GPU should should be listed in /dev/dri/card{n} for n = 0, 1, ..., and this PR is mounting /dev/dri entirely.

I'll think about autodetection like we have for NVIDIA GPUs, but didn't immediately know what to check. AMD likes to install /opt/rocm/bin/hipconfig to check the version of the rocm libs, but that doesn't imply there are actual GPUs available. Maybe best is to check if vendor data is available from /dev/dri/card* and/or /dev/kfd/*.

haampie commented 3 years ago

Ok, so the way rocm_agent_enumerator detects AMD GPUs is by calling hsa_iterate_agents, which is available from a spack package https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/hsa-rocr-dev/package.py, but depends on AMD's fork of LLVM :D so not a great dependency to just add to Sarus.

Another idea is to check if rocminfo is in the PATH or /opt/rocm/bin/rocminfo exists, and if so execute it and grep the output for some string. That's a bit ugly, but probably easiest.

Madeeks commented 3 years ago

Let me elaborate a bit more my question about hook interface and device selection.

The CUDA runtime uses the CUDA_VISIBLE_DEVICES environment variable to determine the GPU devices applications have access to. The NVIDIA Container Toolkit uses NVIDIA_VISIBLE_DEVICES to determine which GPUs to mount inside the container. By checking for the presence of such variables, Sarus does not need an explicit CLI option to know if the host process is requesting GPU devices (and which ones).

I was wondering if there were analogous variables in the ROCm environment. A quick seach brought me to the following issues: https://github.com/RadeonOpenCompute/ROCm/issues/841, https://github.com/RadeonOpenCompute/ROCm/issues/994 From what I understand there are 2 variables which cover similar roles: HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES. I don't have experience with ROCm, so according to you can anyone of those be used to control hook activation? If so, which one is the most appropriate? How does the numerical ids in those variables relate to the /dev/dri/* files?

As an additional reference, the GRES plugin of Slurm sets CUDA_VISIBLE_DEVICES to the GPUs allocated by the workload manager. What's the mechanism implemented by Slurm (or other workload managers) to signal allocation of AMD GPUs?

haampie commented 3 years ago

Ah, Ault is configured such that by default you get all GPUs.

$ srun -p amdvega /bin/bash -c 'echo "ROCM_VISIBLE_DEVICES: $ROCR_VISIBLE_DEVICES"; /opt/rocm/bin/rocm_agent_enumerator; ls /dev/dri/card*'
ROCM_VISIBLE_DEVICES: 
gfx000
gfx906
gfx906
gfx906
/dev/dri/card0
/dev/dri/card1
/dev/dri/card2
/dev/dri/card3

$ srun -p amdvega --gres=gpu:1 /bin/bash -c 'echo "ROCM_VISIBLE_DEVICES: $ROCR_VISIBLE_DEVICES"; /opt/rocm/bin/rocm_agent_enumerator; ls /dev/dri/card*'
ROCM_VISIBLE_DEVICES: 0
gfx000
gfx906
/dev/dri/card0
/dev/dri/card1
/dev/dri/card2
/dev/dri/card3

$ srun -p amdvega --gres=gpu:3 /bin/bash -c 'echo "ROCM_VISIBLE_DEVICES: $ROCR_VISIBLE_DEVICES"; /opt/rocm/bin/rocm_agent_enumerator; ls /dev/dri/card*'
ROCM_VISIBLE_DEVICES: 0,1,2
gfx000
gfx906
gfx906
gfx906
/dev/dri/card0
/dev/dri/card1
/dev/dri/card2
/dev/dri/card3

$ srun -p amdvega --gres=gpu:2 /bin/bash -c '/opt/rocm/bin/rocminfo | grep GPU'
  Uuid:                    GPU-3f50506172fc1a63               
  Device Type:             GPU                                
  Uuid:                    GPU-3f4478c172fc1a63               
  Device Type:             GPU                                

$ srun -p amdvega --gres=gpu:2 /bin/bash -c '/opt/rocm/opencl/bin/clinfo | grep Number'
Number of platforms:                 1
Number of devices:               2
haampie commented 3 years ago

So, ROCM_VISIBLE_DEVICES is only set by when --gres=gpu[:n] is provided. When it is set, I think it's handled on the software level by the ROCm stack, so we might not want to bother doing the bookkeeping of mounting exactly those specific GPUs from /dev/dri, but leave ROCm to that. For instance:

$ ROCR_VISIBLE_DEVICES=1,2 sarus run -t --mount=type=bind,src=/dev/kfd,dst=/dev/kfd --mount=type=bind,src=/dev/dri,dst=/dev/dri stabbles/sirius-rocm /opt/spack/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rocminfo-4.0.0-lruzhymnjm4hez3jeuyf3kyhmjjloqyp/bin/rocm_agent_enumerator
gfx000
gfx906
gfx906

How about we just unconditionally mount /dev/kfd and /dev/dri when they exist?


Edit: in fact I find it only confusing to mount just a few specific GPUs, because ROCR_VISIBLE_DEVICES=1,2 should then be unset or relabeled to ROCR_VISIBLE_DEVICES=0,1 inside the container:

$ ls /dev/dri/
by-path  card0  card1  card2  card3  renderD128  renderD129  renderD130

$ ROCR_VISIBLE_DEVICES=1,2 sarus run \
  --mount=type=bind,src=/dev/kfd,dst=/dev/kfd \
  --mount=type=bind,src=/dev/dri/renderD129,dst=/dev/dri/renderD129 \
  --mount=type=bind,src=/dev/dri/renderD130,dst=/dev/dri/renderD130 \
  stabbles/sirius-rocm /bin/bash -c '/opt/spack/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rocminfo-4.0.0-lruzhymnjm4hez3jeuyf3kyhmjjloqyp/bin/rocminfo'
.. only shows 1 gpu because ROCR_VISIBLE_DEVICES is still 1,2 and the GPUs are labeled 0,1 now ...

$ ROCR_VISIBLE_DEVICES=1,2 sarus run \
  --mount=type=bind,src=/dev/kfd,dst=/dev/kfd \
  --mount=type=bind,src=/dev/dri/renderD129,dst=/dev/dri/renderD129 \
  --mount=type=bind,src=/dev/dri/renderD130,dst=/dev/dri/renderD130 \
  stabbles/sirius-rocm /bin/bash -c 'unset ROCR_VISIBLE_DEVICES && /opt/spack/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rocminfo-4.0.0-lruzhymnjm4hez3jeuyf3kyhmjjloqyp/bin/rocminfo'
... shows 2 gpus correctly ...