ROCm / ROCm-docker

Dockerfiles for the various software layers defined in the ROCm software platform
MIT License
402 stars 67 forks source link

`rocminfo` fails in `rocm/rocm-terminal` #116

Open danpetreamd opened 8 months ago

danpetreamd commented 8 months ago

rocm-smi works fine.

The following was run on a 4x GPU System:

$ docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G rocm/rocm-terminal:latest

# rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership

# rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    37.0c           38.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
1    40.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
2    41.0c           42.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
3    39.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================

# groups
rocm-user sudo video

rocminfo works fine in rocm/dev-ubuntu-22.04 and rocm/pytorch:

$ docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G rocm/pytorch:latest

# rocminfo | grep MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100

# groups
root video
danpetreamd commented 8 months ago

Using the instructions in the README.md:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video rocm/rocm-terminal
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

# rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership
danpetreamd commented 8 months ago

I'm wondering if this image is still in use and/or if we can deprecate it.

baryluk commented 6 months ago

Same here.

Also, one does not need do sudo docker ...

As long as user is in a docker group, one can do just docker .... Usage of docker with sudo should not be promoted like this (not that it matter too much).

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video rocm/rocm-terminal
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

rocm-user@015b5fcf64bf:~$ rocminfo 
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership
rocm-user@015b5fcf64bf:~$ logout
$ 

There reason is because video group is not good, it should be render:

$ ls -l /dev/kfd 
crw-rw---- 1 root render 243, 0 Dec 14 04:54 /dev/kfd
$
$ grep render /etc/group
render:x:993:user
$

For some reasons it does not work tho:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --group-add render rocm/rocm-terminal
docker: Error response from daemon: Unable to find group render: no matching entries in group file.
ERRO[0000] error waiting for container: context canceled 

probably because /etc/group in the container is different.

Running docker run with --user=root is a an option, which is not too bad (file system, processes, etc, are still isolated and safe), but would be nice to find a nicer solution.

baryluk commented 6 months ago

This looks related - https://github.com/RadeonOpenCompute/ROCm-docker/issues/90

baryluk commented 4 months ago

Still broken when following instructions current README.md

If I pass render gid by number it complains, but works:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add $(getent group render | cut -d: -f3)  rocm/rocm-terminal
groups: cannot find name for group ID 993
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

rocm-user@3e3292ebfa5c:~$ 

and /dev/kfd works inside (i.e. rocminfo has no issues accessing it)