Closed hammad93 closed 1 year ago
Use lshw -C display
to show available GPU. Using the Standard NC6 Promo on Azure, we have the following output,
*-display
description: VGA compatible controller
product: Hyper-V virtual VGA
vendor: Microsoft Corporation
physical id: 8
bus info: pci@0000:00:08.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master rom
configuration: driver=hyperv_fb latency=0
resources: irq:11 memory:f8000000-fbffffff memory:c0000-dffff
*-display UNCLAIMED
description: 3D controller
product: GK210GL [Tesla K80]
vendor: NVIDIA Corporation
physical id: 1
bus info: pci@0001:00:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: latency=0
resources: iomemory:100-ff iomemory:140-13f memory:41000000-41ffffff memory:1000000000-13ffffffff memory:1400000000-1401ffffff
Based on this link here, https://askubuntu.com/questions/1344129/what-does-display-unclaimed-mean-in-response-to-sudo-lshw-c-video
It seems like the driver was not installed. This could be because the standard image used in the Azure VM was Linux (ubuntu 20.04)
After installing the package using apt install ubuntu-drivers-common
, we can list appropriate drivers for the K80 using ubuntu-drivers devices
with the following output. We can then install the driver using apt install nvidia-470
because its "recommended". Then, we were able to use nvidia-smi
command for a sanity check of the GPU which was not available before.
WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
ERROR:root:could not open aplay -l
Traceback (most recent call last):
File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect
aplay = subprocess.Popen(
File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'aplay'
== /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0 ==
modalias : pci:v000010DEd0000102Dsv000010DEsd0000106Cbc03sc02i00
vendor : NVIDIA Corporation
model : GK210GL [Tesla K80]
driver : nvidia-driver-470 - distro non-free recommended
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-460 - distro non-free
driver : nvidia-driver-460-server - distro non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
Reference https://phoenixnap.com/kb/install-nvidia-drivers-ubuntu#ftoc-heading-8
According to this documentation, https://www.tensorflow.org/install/docker#gpu_support , we need to add a --gpus all
flag to the docker run command resulting in, docker run --gpus all -it -p 8888:8888 huraim
. However, the current build is not setup and needs to follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker first. This is clear when running the gpu Docker run command and receiving the error,
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
After running the following commands according to the installation guide here, we were able to launch the docker run GPU command, docker run --gpus all -it -p 8888:8888 huraim
and run nvidia-smi
to check the status of the GPU.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
To close the issue, we need to make sure we can run training with GPU acceleration.
Okay tested and before it was like 2 to 3 seconds and with GPU it is well under a second, 3x to 20x decrease in training time. Documented in recent README.md commit.
It seems like the Docker container does not use GPU accelerated training on the Azure instance. This issue documents the problem, proposes solutions, hopefully solves it, and if not provides alternatives.