Closed jax79sg closed 4 years ago
Hi, managed to figure out what could be the issue and applied some workarounds in the code.
Potential problem: RTX cards tend to have this issue, it lies in Tensorflow, rather than the application code. As of this comment, this wasn't fixed in the latest TF 1.14, CUDA 10.1 and CUDNN 7.6.
Workaround: The workaround for this is to ensure the following 2 GPU options are set in TF sessions.
gpu_options.allow_growth=True
gpu_options.per_process_gpu_memory_fraction = x #x, a fraction, needs to be experimented on individual cards.
For Jasper codes, i've implemented a stop gap script to deal with it.
sed -i.bak '579i\ tf_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/models/model.py
sed -i.bak '35i\ sess_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/utils/funcs.py
sed -i.bak '229i\ sess_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/utils/funcs.py
Hi,
I am trying to run Jasper training on Librespeech but encountered the following issues.
The code will hang at the end of the following stack trace with the GPU mem locked but no activity. Haven't been able to resolve it by changing versions of cuda and cudnn. Is there a specific version of TF, Cuda and Cudnn required? Or am i missing something?