Issue with Parallel GPU Computing

shantanudev commented 7 years ago

Hi Sean,

I was wondering if you faced an issue where not all the GPUs are being utilized as evident in the output below. Also, it will not allow me to enter a larger batch even though I have more GPUs.

+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:00:17.0 Off | 0 | | N/A 82C P0 113W / 149W | 10815MiB / 11519MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:00:18.0 Off | 0 | | N/A 47C P0 73W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 0000:00:19.0 Off | 0 | | N/A 61C P0 57W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 | | N/A 51C P0 71W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 | | N/A 63C P0 57W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 | | N/A 49C P0 70W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 | | N/A 64C P0 57W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 | | N/A 49C P0 71W / 149W | 208MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 6318 C /home/ec2-user/src/torch/install/bin/luajit 10757MiB | | 1 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 2 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 3 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 4 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 5 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 6 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | | 7 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB | +-----------------------------------------------------------------------------+

SeanNaren commented 7 years ago

I haven't got a multi-GPU node to test this on, but have you set the -nGPU flag correctly like below?

th Train.lua -nGPU 8

shantanudev commented 7 years ago

@SeanNaren Yes, I have done this. Basically it limits me to a batch size of about 30 even though I have 8 GPUs.

SeanNaren commented 7 years ago

I just ran this on our internal AWS K80 server and it worked fine:

It was already running something however all GPUs were used when I used th Train.lua -nGPU. Are you using the latest branch?

shantanudev commented 7 years ago

@SeanNaren Hmm, let me do some investigation on my end. I will let you know.

SeanNaren / deepspeech.torch

Issue with Parallel GPU Computing #68