snakers4 commented 5 years ago

Will be posting some of our CNN-based model runs based on our fork

Our main aim is to find a way to train models faster on conventional hardware (1080Ti)

snakers4 commented 5 years ago

LS baseline test

The model is essentially wav2letter, ~23M params (will try a larger model next)
~9 hours on 4 1080Ti
Validation - test_clean and test_other
No augmentations

Train and test_clean - CER around 9% out of the box test_other - CER around 23-24%

V="cnn_256_7_librispeech_baseline_d02_aug02_spect02";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
  train.py \
  --hidden-layers 6 \
  --aug-prob-8khz 0.0 --dropout 0.2 \
  --data-parallel --noise-prob 0.2 \
  --aug-prob-spect 0.2 \
  --visdom --id "$V" \
  --noise-dir "../data/augs/*.wav" \
  --learning-rate 5e-3 --momentum 0.5 \
  --rnn-type cnn --cnn-width 256 --hidden-size 2048 \
  --epochs 50 --cuda --batch-size 400 --val-batch-size 400  \
  --cache-dir ../data/cache --augment \
  --train-manifest ../data/manifests/libri_train_manifest_fx.csv \
  --train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
  --val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
  --learning-anneal 1.0 --checkpoint-anneal 1.01 \
  --checkpoint --save-folder ../models/en/$V \
  --window hann --labels-path "labels.json" \
  --checkpoint-per-samples 250000 --num-workers 10 \

snakers4 commented 5 years ago

Wider CNN network (3x wider convolutions)

The model is essentially wav2letter, ~71M params (~ResNet152)
~2 days on 4 1080Ti
Validation - test_clean and test_other
Minor augmentations (see below)

After 2 days and 38 epochs (greedy search):

train (w augs) ~5.5 CER
test_clean ~5.9 CER
test_other ~17 CER

I believe CER of ~3.5-4% is achievable after 70 epochs on the clean part of the dataset Other is falling behind (more augs, more expressiveness, GLU?)

# fatter LR model
V="cnn_768_7_librispeech_baseline_d005_aug015_spect015";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
  train.py \
  --log-dir '../runs' \
  --hidden-layers 6 \
  --aug-prob-8khz 0.0 --dropout 0.05 \
  --data-parallel --noise-prob 0.15 \
  --aug-prob-spect 0.15 \
  --tensorboard --id "$V" \
  --noise-dir "../data/augs/*.wav" \
  --learning-rate 5e-3 --momentum 0.5 \
  --rnn-type cnn --cnn-width 768 --hidden-size 2048 \
  --epochs 50 --cuda --batch-size 200 --val-batch-size 200  \
  --cache-dir ../data/cache --augment \
  --train-manifest ../data/manifests/libri_train_manifest_fx.csv \
  --train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
  --val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
  --learning-anneal 1.01 --checkpoint-anneal 1.00 \
  --checkpoint --save-folder ../models/en/$V \
  --window hann --labels-path "labels.json" \
  --checkpoint-per-samples 250000 --num-workers 10 \

Will be comparing with GRU model shortly on the same pipeline. Preliminary hunch - GRU will be better, but 3x slower

snakers4 commented 5 years ago

So far what is interesting, 1 epoch of batch size 200 (50x4) for CNN takes around the same time as batch size 80 (20x4) for GRU

snakers4 commented 5 years ago

Comparing CNNs and RNNs (just took the posted settings for a GRU model):

Batch-size for a CNN with 70m params (Wav2Letter with 768 filters) is 50 per GPU, batch-size for a GRU with 40m params (5 layers * 800 + masked conv layers as inputs) is 10 per GPU (20 produces weird CUDA errors);
Also notice GPU memory usage chart - RNN is much more volatile: ;
Convergence:
- Epoch takes 2-2.5x longer for an RNN;
- On the same epoch, RNN results are 5-10% better;
- If we just take time into consideration, CNN is 30-40% better given the same time budget;

The only questions remaining are (i) will GRU converge better / to a better minimum on a test_other set (ii) can we use smaller batch-size or higher lr with CNNs (i.e. 25 instead of 50 for example) to speed up convergence even more?

snakers4 commented 5 years ago

A GRU network on librispeech

Gave it 1+ days to converge The only issue is that now with our pipeline the performance on test_other is lagging several pp Probably training for 30-50 epochs more will help, but nevertheless

TLDR - it looks like you can use even a larger CNN that a RNN, it will train 2x faster with the same performance

GRU

Compare to CNN Note that on the same epoch performance is the same If we normalize by time, CNN is far ahead

Dotted lines indicated the positions

where RNN is by iterations on data (batch size is smaller, so updates are more frequent)
where the CNN had the same amount of time to converge

So my advice to the community is probably is to try CNNs from our fork as a drop in replacement for a RNN and probably post some results)

snakers4 commented 5 years ago

Worrying things

I would expect test_other to perform 2-3pp better when training. I understand that model should fit 2-3x more time to the data, but whatever. It is not the point to overfit on an "easy" dataset;
To achieve the same performance with CNNs I had to make the network 2-3x wider. It still compares well and is faster but due to the number of params the weights are 700M instead of 300M;
Facebook reports competitive figures on LS with a small wav2letter model. Probably they train the model much longer, but in my case a small model seemed to plateau quickly;

tugstugi commented 5 years ago

@snakers4 I never understood why the speech recognition guys always try/like mostly 1D convolutions. If you treat the spectrogram as a pure image, the speech recognition problem will be the same as an OCR task. Both of them try to predict sequence of characters from an image.

As a fun project, I have tried an OCR-CRNN network with only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean. Here is a demo Colab notebook: https://colab.research.google.com/github/tugstugi/dl-colab-notebooks/blob/master/notebooks/CRNNSpeech2Text.ipynb

So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.

snakers4 commented 5 years ago

@buriy Would be also interested in your opinion on this

@tugstugi

Many thanks for you idea, looks very interesting and promising! In reality, we are doing Russian language language, and we decided to test some of models on an "easy" dataset like LS. We will definitely try your model!

I never understood why the speech recognition guys always try/like mostly 1D convolutions.

I cannot really say for "speech recognition guys", because I am not one of them, but in my case the logic that got us there was roughly the following:

RNNs had (?) best quality but are known to be very slow (Google papers saying to stack 8 2048-sized LSTM layers are of course nice);
FAIR seems to do sensible stuff, their wav2letter is very easy to implement and straight-forward;
So I just implemented wav2letter from scratch and played with stuff. It mostly worked, but end-to-end stuff with learnable front-ends did not work for me;
Then I found this repo and @buriy 's fork with CNNs;
Then I sped up the processing and loading 2-4x, removed some parts of the pipeline that I did not like;
We collaborated with @buriy to create our dataset;
We trained some Russian models with some decent success (we are more limited by data than models or compute);
We came to a conclusion that something may be wrong with our pre-processing pipeline;
We decided to bench our models against some training curves posted in this repo;

mostly 1D convolutions.

I would also guess that probably this has something to to with MFCC features being widely used in speech. They are much more 1D than 2D.

You can also apply normal image augmentations on the spectogram

Btw, we already do that =) Now we do spectrogram masking, and probably will do some stretching (but how is is different from just changing the speed of audio)? What specific augs would you suggest by your experience? What works best?

only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean.

This seems very promising compared to this

WER14 ~ CER5-7m which is really good with 8M params!

So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.

I guess this would be a good start, right?

tugstugi commented 5 years ago

WER 14.6% was on dev-clean. I haven't tested on test-clean.
in addition to spectrogram masking, the normal image augmentations like albumentations.Blur and albumentations.Cutout work as well. I haven't tried other image augmentations.
in a kaggle kernel, CRNN can be trained for 50 epochs on train-clean-100h about 8 hours. So you can finetune there your hyper parameter parallely.

snakers4 commented 5 years ago

@vadimkantorov @buriy

Some benches

Did some more benches network_bench.xlsx

So far short conclusions are the following:

CNNs train 2-3x faster than RNNs, and parallelize much better with DP (did not try DDP with them, probably should try DDP with RNNs);
I achieved some relatively interesting results, but nowhere near to these curves posted by @xhzhao;
On my setup CNNs definitely beat RNNs in terms of speed / convergence, but I cannot match the reported convergence;
Smaller batch size boosts convergence (10-20) on early epochs, but with RNNs it severely slows everything down even on one GPU (or I do not understand something);
LR param defaults in DS seem to be optimal;
Suspected (but not confirmed suspects):
- Some differences in library versions;
- Augmentations - in my runs augs did not really help - probably ;
- Some subtle differences in pre-processing we do not understand;

Also I did not turn off DP when using 1 GPU. Previously in my experience this never caused a problem.

@buriy Can you probably also post your findings / conclusions in comparing the different versions of DS library on toy sets as some form of table? I used my fork, but it looks like you compared different versions of DS and torch audio out-of-the-box on small / toy datasets.

snakers4 commented 5 years ago

We have experimented with our dataset and models extensively and have come to the following conclusions, which we will confirm soon by fully training several models (we have done some preliminary tests for the majority of the below ideas).

Adding domain data, even annotated by ASR, helps if you add at least 500-1000 hours. Adding more - looks like an exercise in balancing your domain;
Obviously adding more capacity improves the results at the cost of speed;
Adding residual connections helps convergence a bit;
Using dilated convolutions improves generalization and convergence, but slows down the network drastically;
Adding separable convolutions (e.g. 12 groups) + some mixing 1x1 convolution afterwards drastically improves the training speed (3.5x) and reduces model parameters (x5), but the model training itself takes almost the same amount of time (maybe 10-15% faster overall). There is also much less over-fitting in this case;
Using overall stride of 4 with or without BPE also helps training speed (effect diminishes with more GPUs, but at least 30% of speed improvement is possible). Have not experimented with stride 8 yet;
We also tried jasper and tds and mostly failed to see any benefit in copying their architectures, except for basic ideas like skip-connections and separable convolutions;
Mixed supervision (graphemes + phonemes) helps, iterative training with more and more augmentations and curriculum learning also helps;

TLDR

dilation - 0.5x speed + some quality
separable convolutions 3-4x speed, overall training time and quality does not change much (yet to discover whether the model underfits)
larger stride - at least 30-40% speed improvement in multi-GPU setting;

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SeanNaren / deepspeech.pytorch

Model convergence curves on LibriSpeech #415

LS baseline test

Wider CNN network (3x wider convolutions)

Comparing CNNs and RNNs (just took the posted settings for a GRU model):

A GRU network on librispeech

Some benches