Training.py not running on GPU

imflash217 commented 5 years ago

Hi @f90 , I have a GPU in my paperspace machine. I am trying to run Training.py using the following command: python Training.py with cfg.full_44KHz. But the script runs only on CPU, not on GPU. I have just forked this repo and executing the Training.py script. How can I make sure that it runs on GPU. I am a Pytorch guy, and trying to refactor the codebase for Pytorch. Please help. Thak you

f90 commented 5 years ago

Can you verify that you have tensorflow-gpu pip package installed and not tensorflow (which is the CPU-only version)? That would explain the problem since it would fall back to CPU in that case.

Also since this is a cloud service you are running it on, check whether the machine shows a GPU when running nvidia-smi in the same environment as the one you are running Training.py, maybe the GPU is not properly attached to the instance?

imflash217 commented 5 years ago

I have both tensorflow and tensorflow-gpu pip package installed. nvidia-smi shows the attached GPU to the same instance.

I am able to successfully run GPU enabled Pytorch scripts from the same environment on the same GPU.

I am trying to remove tensorflow and install only tensorflow-gpu now.

f90 commented 5 years ago

Yes you should remove tensorflow if you have tensorflow-gpu installed already. Otherwise you can also insert a small code snippet into Training.py to list the available GPU devices and see whether it's properly detected there.

imflash217 commented 5 years ago

Hi @f90 Now the script runs on GPU but it throws the following ERROR after filling the shuffle buffer:

INFO - Waveunet Training - Running command 'run'
INFO - Waveunet Training - Started
SCRIPT START
EPOCH: 0
Dataset ready!
Training...
Sep_Vars: 10265550
Num of variables65
2019-07-25 05:10:09.872823: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-25 05:10:10.286584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-25 05:10:10.286914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Quadro P4000 major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:00:05.0
totalMemory: 7.92GiB freeMemory: 7.83GiB
2019-07-25 05:10:10.286964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2019-07-25 05:10:10.640890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-25 05:10:10.640952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2019-07-25 05:10:10.640968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2019-07-25 05:10:10.641194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7566 MB memory) -> physical GPU (device: 0, name: Quadro P4000, pci bus id: 0000:00:05.0, compute capability: 6.1)
2019-07-25 05:10:27.643833: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 2054 of 4000
2019-07-25 05:10:35.917445: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
2019-07-25 05:10:36.175698: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-07-25 05:10:36.175820: E tensorflow/stream_executor/cuda/cuda_dnn.cc:463] possibly insufficient driver version: 384.183.0
2019-07-25 05:10:36.175842: E tensorflow/stream_executor/cuda/cuda_dnn.cc:427] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2019-07-25 05:10:36.175859: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted (core dumped)

Any insights on this.

f90 commented 5 years ago

This seems like an NVIDIA issue, not related to my code.

Whats your NVIDIA Driver, CUDA and CuDNN versions? Looks like cuDNN version might not be the right one. Tensorflow 1.8 requires CUDA 9, and I have installed cuDNN 7.5.0 alongside and it works. I also noticed your driver seems quite old according to the logs, maybe you can update that?

imflash217 commented 5 years ago

Updating the Nvidia driver to version-396+ solved the issue. Thanks @f90

f90 / Wave-U-Net

Training.py not running on GPU #33