NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

Unable to Fire Openseq2seq Jasper training #496

Closed pratapaprasanna closed 4 years ago

pratapaprasanna commented 4 years ago

Hi all,

I have been trying to fire a training on openseq2seq and i see that the training doesn't start.

$ uname -a
Linux shaktimaan 4.18.0-24-generic #25~18.04.1-Ubuntu SMP Thu Jun 20 11:13:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I have installed the drivers freshly and i see that the training doesnt start at all.

It is stuck here for almost 12 hrs

Colocation members, user-requested devices, and framework assigned devices, if any:
  ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/Initializer/zeros (Const)
  ForwardPass/fully_connected_ctc_decoder/fully_connected/bias (VariableV2) /device:GPU:0
  ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/Assign (Assign) /device:GPU:0
  ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/read (Identity) /device:GPU:0
  Loss_Optimization/gradients/AddN (AddN) /device:GPU:0
  Loss_Optimization/FP32-master-copy/IsVariableInitialized_109 (IsVariableInitialized) /device:GPU:0
  Loss_Optimization/FP32-master-copy/cond_109/read/Switch (RefSwitch) /device:GPU:0
  Loss_Optimization/FP32-master-copy/cond_109/Switch_1 (Switch)
  Loss_Optimization/FP32-master-copy/ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/IsVariableInitialized (IsVariableInitialized) /device:GPU:0
  Loss_Optimization/FP32-master-copy/ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/cond/read/Switch (RefSwitch) /device:GPU:0
  Loss_Optimization/FP32-master-copy/ForwardPass/fully_connected_ctc_decoder/fully_connected/bias/cond/Switch_1 (Switch)
  Loss_Optimization/FP32-master-copy/cond_109/read/Switch_Loss_Optimization/FP32-master-copy/ForwardPass/fully_connected_ctc_decoder/fully_connected/bias (Switch)
  Loss_Optimization/cond_1/Assign_109/Switch (RefSwitch) /device:GPU:0
  Loss_Optimization/cond_1/Assign_109 (Assign) /device:GPU:0
  save/Assign (Assign) /device:GPU:0
  save_1/Assign (Assign) /device:GPU:0
  report_uninitialized_variables/IsVariableInitialized_541 (IsVariableInitialized) /device:GPU:0
  report_uninitialized_variables_1/IsVariableInitialized_541 (IsVariableInitialized) /device:GPU:0
  save_2/Assign_550 (Assign) /device:GPU:0

2019-09-05 20:24:08.016954: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From /home/vz/miniconda3/envs/gp_0_1/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.

*** Running evaluation on a validation set:

Can anyone help me in understanding the issue

when i fire other trainings i see that the Gpu is being utilized but donno why it is not working

with Tensorflow or Openseq2seq

I followed all the steps in installations instructions.

Thank you.

Environment

$ pip freeze | grep tensorflow
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
pratapaprasanna commented 4 years ago

Hi all,

Seems like there is an issue with my nvlink,

so if Incase if your training is taking too much Time please check if the links are up and not down.

The following command is as follows.

$ nvidia-smi nvlink --status
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-797d7153-ea28-d678-dc38-859b914d6dd7)
     Link 0: 25.781 GB/s
     Link 1: 25.781 GB/s
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-8807c553-7571-582d-c2ee-02993527b0a6)
     Link 0: 25.781 GB/s
     Link 1: 25.781 GB/s

Thanks