apatel726 / HurricaneDissertation

2 stars 2 forks source link

Docker container does not use GPU #52

Closed hammad93 closed 1 year ago

hammad93 commented 2 years ago

It seems like the Docker container does not use GPU accelerated training on the Azure instance. This issue documents the problem, proposes solutions, hopefully solves it, and if not provides alternatives.

hammad93 commented 2 years ago

Use lshw -C display to show available GPU. Using the Standard NC6 Promo on Azure, we have the following output,

  *-display                 
       description: VGA compatible controller
       product: Hyper-V virtual VGA
       vendor: Microsoft Corporation
       physical id: 8
       bus info: pci@0000:00:08.0
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: vga_controller bus_master rom
       configuration: driver=hyperv_fb latency=0
       resources: irq:11 memory:f8000000-fbffffff memory:c0000-dffff
  *-display UNCLAIMED
       description: 3D controller
       product: GK210GL [Tesla K80]
       vendor: NVIDIA Corporation
       physical id: 1
       bus info: pci@0001:00:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list
       configuration: latency=0
       resources: iomemory:100-ff iomemory:140-13f memory:41000000-41ffffff memory:1000000000-13ffffffff memory:1400000000-1401ffffff

Based on this link here, https://askubuntu.com/questions/1344129/what-does-display-unclaimed-mean-in-response-to-sudo-lshw-c-video

It seems like the driver was not installed. This could be because the standard image used in the Azure VM was Linux (ubuntu 20.04)

hammad93 commented 2 years ago

After installing the package using apt install ubuntu-drivers-common, we can list appropriate drivers for the K80 using ubuntu-drivers devices with the following output. We can then install the driver using apt install nvidia-470 because its "recommended". Then, we were able to use nvidia-smi command for a sanity check of the GPU which was not available before.

WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
ERROR:root:could not open aplay -l
Traceback (most recent call last):
  File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect
    aplay = subprocess.Popen(
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'aplay'
== /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0 ==
modalias : pci:v000010DEd0000102Dsv000010DEsd0000106Cbc03sc02i00
vendor   : NVIDIA Corporation
model    : GK210GL [Tesla K80]
driver   : nvidia-driver-470 - distro non-free recommended
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-460-server - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

Reference https://phoenixnap.com/kb/install-nvidia-drivers-ubuntu#ftoc-heading-8

hammad93 commented 2 years ago

According to this documentation, https://www.tensorflow.org/install/docker#gpu_support , we need to add a --gpus all flag to the docker run command resulting in, docker run --gpus all -it -p 8888:8888 huraim . However, the current build is not setup and needs to follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker first. This is clear when running the gpu Docker run command and receiving the error,

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
hammad93 commented 2 years ago

After running the following commands according to the installation guide here, we were able to launch the docker run GPU command, docker run --gpus all -it -p 8888:8888 huraim and run nvidia-smi to check the status of the GPU.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2

To close the issue, we need to make sure we can run training with GPU acceleration.

hammad93 commented 1 year ago

Okay tested and before it was like 2 to 3 seconds and with GPU it is well under a second, 3x to 20x decrease in training time. Documented in recent README.md commit.