NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 371 forks source link

jasper10x5_LibriSpeech_nvgrad training time #425

Closed riyijiye closed 5 years ago

riyijiye commented 5 years ago

for jasper10x5_LibriSpeech_nvgrad, I did noticed some training details are described in below ticket https://github.com/NVIDIA/OpenSeq2Seq/issues/415

Can anyone share the training time information as well (exactly following the jasper10x5_LibriSpeech_nvgrad.py example config)?

thanks!

vsl9 commented 5 years ago

We are not publishing training time benchmarks. Since training time numbers highly depend on numerous hardware and software related factors (GPU, RAM, I/O bandwidth; TensorFlow, Horovod, CUDA, OS, driver versions, etc.), it requires significant efforts to measure and report them in consistent manner. A better venue for such benchmarks might be MLPerf.org.

riyijiye commented 5 years ago

thanks!

germany-zhu commented 5 years ago

I have trained it for a whole month and it hasn't still been completed... I change batch size from 32 to 8, and don't use horovod. I install OpenSeq2Seq following general installation instruction and train with 4 GeForce GTX 1080ti GPUs. Do you meet the same problem that the training time is too long?

borisgin commented 5 years ago

How many epochs? What is a time per iteration? We used DGX1 with 8x V100 , and we trained with Horovod and mixed precision, which is ~ 2.5x faster than with float32.

blisc commented 5 years ago

I would recommend lowering num_epochs to 50; maybe even 40. I expect WER will decrease by 1-2%. I would also either remove speed perturbation or precompute and store the perturbed files as wav. 400 epochs on lower batch size, lower number of gpus and older gpu architecture will slow training time down by an order of magnitude.

germany-zhu commented 5 years ago

My time per step is about 2.5 seconds and I think it's reasonable. Thanks for your explanations, and now I understand why it is low. I will change to GTX 2080 for faster training, because I want to reach the best performance of jasper in reasonable time. Another question: how can I build TensorFlow from sources with a custom CTC decoder operation when using NVIDIA TensorFlow Docker container?

blisc commented 5 years ago

If you are looking to build our current CTC decoder with language model, it should already be included in the nvidia tensorflow container.

We should have an update to the CTC decoder with language model soon that does not require rebuilding tensorflow.

xw1324832579 commented 5 years ago

@blisc hello, as I have run jasper with the same four machines as @germany-zhu ,and after 3 epochs wer is about 37% ,is it regular? I change nothing but the learning rate to 0.05. What's more, I didn't use docker, does it matter ,can docker speed up the training process or I should change faster machines to train jasper...

borisgin commented 5 years ago

1) We used DGX1 server with 8 V100 to train large Jasper models and it takes around a week. On desktop with 4 cards this can take a lot of time. If you want to train a model for scratch, I would recommend to train first smaller model, for example with 5 blocks and 4 layers per block. 2) training with Horovod and docker is significantly faster

Shujian2015 commented 4 years ago

We used 8-v100 on GCP. Each epoch takes about 22 min with this default setting: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper10x5_LibriSpeech_nvgrad_masks.py

borisgin commented 4 years ago

Which version of V-100 have you used? 16 GB or 32 GB? Have you used nvidia tensorFlow docker image, or it was "native" TensorFlow?

Shujian2015 commented 4 years ago

The 16 Gb version and latest Nvidia TensorFlow docker image.

feroult commented 4 years ago

@Shujian2015 we're getting OOM with this default config with Docker + V100 16GB. Horovod enabled.

Have you experienced OOMs?

Shujian2015 commented 4 years ago

@feroult, the default settings work well. I didn’t experience OOM issue. Although for Fisher-swbd, I have to reduce batch size to avoid oom.