SeanNaren / deepspeech.pytorch

Speech Recognition using DeepSpeech2.
MIT License
2.1k stars 620 forks source link

Model convergence curves on LibriSpeech #415

Closed snakers4 closed 4 years ago

snakers4 commented 5 years ago

Will be posting some of our CNN-based model runs based on our fork

Our main aim is to find a way to train models faster on conventional hardware (1080Ti)

snakers4 commented 5 years ago

LS baseline test

Train and test_clean - CER around 9% out of the box test_other - CER around 23-24%

V="cnn_256_7_librispeech_baseline_d02_aug02_spect02";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
  train.py \
  --hidden-layers 6 \
  --aug-prob-8khz 0.0 --dropout 0.2 \
  --data-parallel --noise-prob 0.2 \
  --aug-prob-spect 0.2 \
  --visdom --id "$V" \
  --noise-dir "../data/augs/*.wav" \
  --learning-rate 5e-3 --momentum 0.5 \
  --rnn-type cnn --cnn-width 256 --hidden-size 2048 \
  --epochs 50 --cuda --batch-size 400 --val-batch-size 400  \
  --cache-dir ../data/cache --augment \
  --train-manifest ../data/manifests/libri_train_manifest_fx.csv \
  --train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
  --val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
  --learning-anneal 1.0 --checkpoint-anneal 1.01 \
  --checkpoint --save-folder ../models/en/$V \
  --window hann --labels-path "labels.json" \
  --checkpoint-per-samples 250000 --num-workers 10 \

image

snakers4 commented 5 years ago

Wider CNN network (3x wider convolutions)

After 2 days and 38 epochs (greedy search):

I believe CER of ~3.5-4% is achievable after 70 epochs on the clean part of the dataset Other is falling behind (more augs, more expressiveness, GLU?)

# fatter LR model
V="cnn_768_7_librispeech_baseline_d005_aug015_spect015";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
  train.py \
  --log-dir '../runs' \
  --hidden-layers 6 \
  --aug-prob-8khz 0.0 --dropout 0.05 \
  --data-parallel --noise-prob 0.15 \
  --aug-prob-spect 0.15 \
  --tensorboard --id "$V" \
  --noise-dir "../data/augs/*.wav" \
  --learning-rate 5e-3 --momentum 0.5 \
  --rnn-type cnn --cnn-width 768 --hidden-size 2048 \
  --epochs 50 --cuda --batch-size 200 --val-batch-size 200  \
  --cache-dir ../data/cache --augment \
  --train-manifest ../data/manifests/libri_train_manifest_fx.csv \
  --train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
  --val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
  --learning-anneal 1.01 --checkpoint-anneal 1.00 \
  --checkpoint --save-folder ../models/en/$V \
  --window hann --labels-path "labels.json" \
  --checkpoint-per-samples 250000 --num-workers 10 \

image

Will be comparing with GRU model shortly on the same pipeline. Preliminary hunch - GRU will be better, but 3x slower

snakers4 commented 5 years ago

So far what is interesting, 1 epoch of batch size 200 (50x4) for CNN takes around the same time as batch size 80 (20x4) for GRU

snakers4 commented 5 years ago

Comparing CNNs and RNNs (just took the posted settings for a GRU model):

The only questions remaining are (i) will GRU converge better / to a better minimum on a test_other set (ii) can we use smaller batch-size or higher lr with CNNs (i.e. 25 instead of 50 for example) to speed up convergence even more?

snakers4 commented 5 years ago

A GRU network on librispeech

Gave it 1+ days to converge The only issue is that now with our pipeline the performance on test_other is lagging several pp Probably training for 30-50 epochs more will help, but nevertheless

TLDR - it looks like you can use even a larger CNN that a RNN, it will train 2x faster with the same performance

GRU image

Compare to CNN Note that on the same epoch performance is the same If we normalize by time, CNN is far ahead

Dotted lines indicated the positions

image

So my advice to the community is probably is to try CNNs from our fork as a drop in replacement for a RNN and probably post some results)

snakers4 commented 5 years ago

Worrying things

tugstugi commented 5 years ago

@snakers4 I never understood why the speech recognition guys always try/like mostly 1D convolutions. If you treat the spectrogram as a pure image, the speech recognition problem will be the same as an OCR task. Both of them try to predict sequence of characters from an image.

As a fun project, I have tried an OCR-CRNN network with only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean. Here is a demo Colab notebook: https://colab.research.google.com/github/tugstugi/dl-colab-notebooks/blob/master/notebooks/CRNNSpeech2Text.ipynb

So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.

snakers4 commented 5 years ago

@buriy Would be also interested in your opinion on this

@tugstugi

Many thanks for you idea, looks very interesting and promising! In reality, we are doing Russian language language, and we decided to test some of models on an "easy" dataset like LS. We will definitely try your model!

I never understood why the speech recognition guys always try/like mostly 1D convolutions.

I cannot really say for "speech recognition guys", because I am not one of them, but in my case the logic that got us there was roughly the following:

mostly 1D convolutions.

I would also guess that probably this has something to to with MFCC features being widely used in speech. They are much more 1D than 2D.

You can also apply normal image augmentations on the spectogram

Btw, we already do that =) Now we do spectrogram masking, and probably will do some stretching (but how is is different from just changing the speed of audio)? What specific augs would you suggest by your experience? What works best?

only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean.

This seems very promising compared to this

WER14 ~ CER5-7m which is really good with 8M params!

So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.

I guess this would be a good start, right?

tugstugi commented 5 years ago
snakers4 commented 5 years ago

@vadimkantorov @buriy

Some benches

Did some more benches network_bench.xlsx

So far short conclusions are the following:

Also I did not turn off DP when using 1 GPU. Previously in my experience this never caused a problem.

@buriy Can you probably also post your findings / conclusions in comparing the different versions of DS library on toy sets as some form of table? I used my fork, but it looks like you compared different versions of DS and torch audio out-of-the-box on small / toy datasets.

snakers4 commented 5 years ago

We have experimented with our dataset and models extensively and have come to the following conclusions, which we will confirm soon by fully training several models (we have done some preliminary tests for the majority of the below ideas).

TLDR

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.