ROCm / ROCm-docker

Dockerfiles for the various software layers defined in the ROCm software platform
MIT License
431 stars 65 forks source link

How to assign a single GPU to container? #49

Open x1y2z3456 opened 5 years ago

x1y2z3456 commented 5 years ago

Hi, I was wondering whether it is possible to assign a single AMD GPU to container, have tried the following command(trying to assign GPU 0 to container):

docker run -it --network=host --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow bash

But inside the container, using the command rocm-smi still shows two AMD GPUs:

root@ryan-desktop:/root# rocm-smi

==================== ROCm System Management Interface ================ ================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD 0 31c N/A 300Mhz 300Mhz 23.92% manual 0% 0% 1 29c N/A 300Mhz 300Mhz 23.92% auto 0% 0% ================================================================ ==================== End of ROCm SMI Log ===========================

Linux distribution version: ryan@ryan-desktop:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.4 LTS Release: 16.04 Codename: xenial

Docker version: ryan@ryan-desktop:~$ docker --version Docker version 17.03.2-ce, build f5ec1e2

Docker image: rocm/tensorflow

Kernel version: ryan@ryan-desktop: $ uname -a Linux ryan-desktop 4.15.0-38-generic # 41~16.04.1-Ubuntu SMP Wed Oct 10 20:16:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ROCM version: ryan@ryan-desktop:~$ apt show rocm-libs -a Package: rocm-libs Version: 1.9.211 Priority: optional Section: devel Maintainer: Advanced Micro Devices Inc. Installed-Size: 1024 B Depends: rocfft, rocrand, hipblas, rocblas Homepage: https://github.com/RadeonOpenCompute/ROCm Download-Size: 772 B APT-Sources: http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages Description: Radeon Open Compute (ROCm) Runtime software stack

CPU information: model name : Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz

GPU information: RX 580 4G *2

Thanks for help anyway

expertcloudconsulting commented 5 years ago

Any luck on the solution here please?

twobombs commented 5 years ago

You filter the selected device on ocl level inside the container. I'm mobile so I cant look it up now. But there is a way. Also, k8s GPU filter support is evolving.

Edit: export GPU_DEVICE_ORDINAL=1 Source: https://github.com/codeplaysoftware/computecpp-sdk/issues/107

K8s GPU selection support for AMD: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ Note: Rancher 2.2+ has elaborate cluster support for these k8s clusters making these and other tasks more user friendly.

x1y2z3456 commented 4 years ago

The way of "export GPU_DEVICE_ORDINAL=1" assign GPU works but it did just "virtually", which means i can still check 2 GPU by "rocm-smi" command i've tried the new version of rocm-driver, which version is 3.0 it still can not seperate GPUS "physically" what i really want is when i start up a new container with the following command

docker run -it --network=host --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow bash

of course under the directory of /dev ls /dev card0 renderD128 which shows only one GPU

but using rocm-smi shows 2 GPUs rocm-smi GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 47.0c 9.0W 800Mhz 100Mhz 0.0% auto 162.0W 2% 0% 1 49.0c 11.0W 800Mhz 100Mhz 0.0% auto 162.0W 1% 0%

thanks for reply anyway

paklui commented 3 years ago

With the recent ROCm 3.9, I am able to see 1 GPU being reported in rocm-smi inside the docker container. Maybe something was fixed since last reported.

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri/card8 --device=/dev/dri/renderD135 --cap-add=SYS_RAWIO --device=/dev/mem --group-add video --network host rocm/dev-ubuntu-18.04
root@login:/# /opt/rocm/bin/rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK     Fan   Perf  PwrCap  VRAM%  GPU%
0    41.0c  32.0W   930Mhz  1000Mhz  0.0%  auto  225.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================
root@login:/# ls -al /dev/dri/
card8       renderD135
root@login:/# ls -al /dev/dri/*
crw-rw---- 1 root video 226,   8 Dec 12 00:32 /dev/dri/card8
crw-rw---- 1 root video 226, 135 Dec 12 00:32 /dev/dri/renderD135
root@login:/#
GowthamKudupudi commented 1 year ago

Yes with recent ROCm it shows only one GPU but when I'm trying build libtorch inside the container, it gives error something like readkfd permission denied and is not allowed.