NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

Model training stops after 1 step for Speech to Text Jasper model #539

Open conqueror7 opened 4 years ago

conqueror7 commented 4 years ago

Hi, I am trying to train speech to text model on my dataset using checkpoint file of Jasper DR 10x5 model as starting point. Jasper model reference link is below- https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#decoders-ref I have created my dataset and using the config file from above link, I ran the training code using cmd as- python run.py --mode=train --config_file=example_configs/speech2text/jasper10x5_LibriSpeech_nvgrad_masks.py --enable_logs --continue_learning I had made changes in config file for training_params section in dataset_files as per my CSV dataset, num_gpus=1, num_epochs=4, batch_size_per_gpu=32. The code runs completely but it stops after 1st step only. I am not able to figure out what is triggering sess.should_stop() in train function. Referring to train function present in file open_seq2seq/utils/funcs.py

This is causing incomplete training. I have around 95K files and batch_size is set to 32. It should ideally run for around 2900 steps. Can you provide the reason for sess stopping after 1st step?

Configuration used- Python Version: 3.6.10 Tensorflow Version : 1.14.0 OpenSeq2Seq commit ID: 61204b212cfe5c9ceda2be816b9052e9caf021a9 Model: Jasper DR 10x5 Model reference link: https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#decoders-ref GPU: 1 P100 Cuda Version : V10.0.130

aayushkubb commented 3 years ago

Can you share the log? If you did continue-training and you are stopping at the same step from where you are beginning the training may stop.

The other reason could be in your config. So if you can show the trace I may help you out.