Closed snakers4 closed 4 years ago
Train and test_clean - CER around 9% out of the box test_other - CER around 23-24%
V="cnn_256_7_librispeech_baseline_d02_aug02_spect02";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
train.py \
--hidden-layers 6 \
--aug-prob-8khz 0.0 --dropout 0.2 \
--data-parallel --noise-prob 0.2 \
--aug-prob-spect 0.2 \
--visdom --id "$V" \
--noise-dir "../data/augs/*.wav" \
--learning-rate 5e-3 --momentum 0.5 \
--rnn-type cnn --cnn-width 256 --hidden-size 2048 \
--epochs 50 --cuda --batch-size 400 --val-batch-size 400 \
--cache-dir ../data/cache --augment \
--train-manifest ../data/manifests/libri_train_manifest_fx.csv \
--train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
--val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
--learning-anneal 1.0 --checkpoint-anneal 1.01 \
--checkpoint --save-folder ../models/en/$V \
--window hann --labels-path "labels.json" \
--checkpoint-per-samples 250000 --num-workers 10 \
After 2 days and 38 epochs (greedy search):
I believe CER of ~3.5-4% is achievable after 70 epochs on the clean part of the dataset Other is falling behind (more augs, more expressiveness, GLU?)
# fatter LR model
V="cnn_768_7_librispeech_baseline_d005_aug015_spect015";
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
train.py \
--log-dir '../runs' \
--hidden-layers 6 \
--aug-prob-8khz 0.0 --dropout 0.05 \
--data-parallel --noise-prob 0.15 \
--aug-prob-spect 0.15 \
--tensorboard --id "$V" \
--noise-dir "../data/augs/*.wav" \
--learning-rate 5e-3 --momentum 0.5 \
--rnn-type cnn --cnn-width 768 --hidden-size 2048 \
--epochs 50 --cuda --batch-size 200 --val-batch-size 200 \
--cache-dir ../data/cache --augment \
--train-manifest ../data/manifests/libri_train_manifest_fx.csv \
--train-val-manifest ../data/manifests/libri_test_clean_manifest_fx.csv \
--val-manifest ../data/manifests/libri_test_other_manifest_fx.csv \
--learning-anneal 1.01 --checkpoint-anneal 1.00 \
--checkpoint --save-folder ../models/en/$V \
--window hann --labels-path "labels.json" \
--checkpoint-per-samples 250000 --num-workers 10 \
Will be comparing with GRU model shortly on the same pipeline. Preliminary hunch - GRU will be better, but 3x slower
So far what is interesting, 1 epoch of batch size 200 (50x4) for CNN takes around the same time as batch size 80 (20x4) for GRU
The only questions remaining are (i) will GRU converge better / to a better minimum on a test_other set (ii) can we use smaller batch-size or higher lr with CNNs (i.e. 25 instead of 50 for example) to speed up convergence even more?
Gave it 1+ days to converge The only issue is that now with our pipeline the performance on test_other is lagging several pp Probably training for 30-50 epochs more will help, but nevertheless
TLDR - it looks like you can use even a larger CNN that a RNN, it will train 2x faster with the same performance
GRU
Compare to CNN Note that on the same epoch performance is the same If we normalize by time, CNN is far ahead
Dotted lines indicated the positions
So my advice to the community is probably is to try CNNs from our fork as a drop in replacement for a RNN and probably post some results)
Worrying things
test_other
to perform 2-3pp better when training. I understand that model should fit 2-3x more time to the data, but whatever. It is not the point to overfit on an "easy" dataset;@snakers4 I never understood why the speech recognition guys always try/like mostly 1D convolutions. If you treat the spectrogram as a pure image, the speech recognition problem will be the same as an OCR task. Both of them try to predict sequence of characters from an image.
As a fun project, I have tried an OCR-CRNN network with only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean. Here is a demo Colab notebook: https://colab.research.google.com/github/tugstugi/dl-colab-notebooks/blob/master/notebooks/CRNNSpeech2Text.ipynb
So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.
@buriy Would be also interested in your opinion on this
@tugstugi
Many thanks for you idea, looks very interesting and promising! In reality, we are doing Russian language language, and we decided to test some of models on an "easy" dataset like LS. We will definitely try your model!
I never understood why the speech recognition guys always try/like mostly 1D convolutions.
I cannot really say for "speech recognition guys", because I am not one of them, but in my case the logic that got us there was roughly the following:
mostly 1D convolutions.
I would also guess that probably this has something to to with MFCC features being widely used in speech. They are much more 1D than 2D.
You can also apply normal image augmentations on the spectogram
Btw, we already do that =) Now we do spectrogram masking, and probably will do some stretching (but how is is different from just changing the speed of audio)? What specific augs would you suggest by your experience? What works best?
only 8M parameters on the LibriSpeech 960h dataset. You can get in 15 hours on a single GPU a WER of 14% dev-clean.
This seems very promising compared to this
WER14 ~ CER5-7m which is really good with 8M params!
So instead of 1D conv, you should try 2D conv with small kernel sizes like resnet/resnext/seresnext+LSTM. You can also apply normal image augmentations on the spectogram. You can also drop the last RNN, but the OCR benchmarks show CNN+attention or CNN+RNN ist better than pure CNN.
I guess this would be a good start, right?
@vadimkantorov @buriy
Did some more benches network_bench.xlsx
So far short conclusions are the following:
Also I did not turn off DP when using 1 GPU. Previously in my experience this never caused a problem.
@buriy Can you probably also post your findings / conclusions in comparing the different versions of DS library on toy sets as some form of table? I used my fork, but it looks like you compared different versions of DS and torch audio out-of-the-box on small / toy datasets.
We have experimented with our dataset and models extensively and have come to the following conclusions, which we will confirm soon by fully training several models (we have done some preliminary tests for the majority of the below ideas).
TLDR
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Will be posting some of our CNN-based model runs based on our fork
Our main aim is to find a way to train models faster on conventional hardware (1080Ti)