ROCm / ROCm-docker

Dockerfiles for the various software layers defined in the ROCm software platform
MIT License
422 stars 64 forks source link

Create `render` group for Ubuntu >= 20, as per ROCm documentation #90

Open romintomasetti opened 2 years ago

romintomasetti commented 2 years ago

Initial issue

As stated in https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html#setting-permissions-for-groups, for Ubuntu 20 and above, the user needs to be part of the render group.

Therefore, we need to create the render group in the docker image. The following would work:

RUN groupadd render

We might also want to update the documentation because the docker run command should contain --group-add render for Ubuntu 20 and above.

Update - 10th June 2022

I made the following experiments. The user I'm logged in on the host is part of the render group. My user ID is 1002.

  1. docker run --rm --device=/dev/kfd rocm/dev-ubuntu-20.04:5.1 rocminfo

    works because it runs as root (with user ID 0 on the host) and

    ll /dev/kfd 
    crw-rw---- 1 root render 510, 0 Jun  9 04:11 /dev/kfd
  2. docker run --rm --user=1002 --device=/dev/kfd rocm/dev-ubuntu-20.04:5.1 rocminfo

    will not work with Unable to open /dev/kfd read-write: Permission denied.

  3. docker run --rm --user=1002 --group-add render --device=/dev/kfd rocm/dev-ubuntu-20.04:5.1 rocminfo

    will not work because inside of rocm/dev-ubuntu-20.04:5.1 there is no render group.

  4. docker run --rm --user=1002 --group-add $(getent group render | cut -d':' -f 3) --device=/dev/kfd rocm/dev-ubuntu-20.04:5.1 rocminfo

    will work again.

Therefore, I see 2 ways of fixing this.

  1. Add a render group in the Docker image with ID 109 by default. This would be a "build time" fix and would break as soon as the host render group ID is not 109. The group ID could be passed as an argument of the build (ARG) but the image would not be portable.

    FROM rocm/dev-ubuntu-20.04:5.1
    
    RUN  groupadd -g 109 render && useradd -g 109 -ms /bin/bash newuser
    USER newuser
  2. The "run time" fix is to use the --group-add $(getent group render | cut -d':' -f 3).
sergejcodes commented 1 year ago

Had a similar issue when I was building a Docker image with ROCm support.

The Problem

A non-root user can't access the GPU resources and has to run commands as sudo for GPU access.

Groups

A user inside the docker container has to be a member of the video and render groups to access the GPU without sudo

Solution

Using Docker ENTRYPOINT to dynamically create and assign the render group with the host system render group id.

Bash Script

Create an entrypoint.sh script, and add it during the build to the image. The script will create the render group with the host's group id and assign the user to the video and render groups.

#!/bin/bash

sudo groupadd --gid $RENDER_GID render
sudo usermod -aG render $USERNAME
sudo usermod -aG video $USERNAME

exec "$@"

Dockerfile

Inside the Dockerfile we create a new user and copy the entrypoint.sh script to the image. A basic example:

FROM ubuntu

ENV USERNAME=rocm-user
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

COPY entrypoint.sh /tmp
RUN chmod 777 /tmp/entrypoint.sh

USER $USERNAME

ENTRYPOINT ["/tmp/entrypoint.sh"]

CMD ["/bin/bash"]
docker build -t rocm-image .

Terminal

When starting the container pass the RENDER_GID environment variable. Let's assume the Docker image is called rocm-image.

export RENDER_GID=$(getent group render | cut -d: -f3) && docker run -it --device=/dev/kfd --device=/dev/dri -e RENDER_GID --group-add $RENDER_GID rocm-image /bin/bash

VS Code Devcontainer

Just add the following code to .devcontainer/devcontainer.json file and you're good to go. A VS Code devcontainer with GPU access.

{
  "build": { "dockerfile": "./Dockerfile" }
  "overrideCommand": false,
  "initializeCommand": "echo \"RENDER_GID=$(getent group render | cut -d: -f3)\" > .devcontainer/devcontainer.env",
  "containerEnv": { "HSA_OVERRIDE_GFX_VERSION": "10.3.0" },
  "runArgs": [
    "--env-file=.devcontainer/devcontainer.env",
    "--device=/dev/kfd",
    "--device=/dev/dri"
  ]
}
pawkubik commented 3 weeks ago

On one of our machines GID of render group on host overlapped with ssh group in the image, so groupadd from the init script failed. It's best to replace use the group id in the following usermod to still get acceptable result in such a scenario.