flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

No TER improvement during training #385

Closed dmzubr closed 5 years ago

dmzubr commented 5 years ago

Hey there! Thanks a lot to all contributors of w2l for the great ASR system! I would like to get some tips from other w2l developers. Hope for your exp and wisdom :)

Well, I have 3 sources of annotated audio data: 1) 90h corpus of good quality audio recorded at sound studio (audio-books); 2) 40h of audio records from different speakers recorded with headset microphones (and different mic models); 3) 30h of audio records created in public places (pharmacies) recorded on the stationary mic (Stelberry M-100). This data includes some bad quality audio chunks. I had listen for more then 3 times in headset to finally get it.

Using network architecture from LibriSpeech recipe.

Got the following results of training:

Problem appears when I try to train on corpus with 3) source of records (30h).

In total, I've tried those attempts with different combination of lr, momentum and gradnorm. Almost the same behavior and results every single time.

Machine config in short: Ubuntu 18.04, Geforce GTX 1080ti on 415.27 driver. W2l installation is based on Docker image. All w2l and flashlight tests passed successfully. Successful trainings (on 1) and 2 sources) were with the same hyper-parameters on this machine and Azure Tesla K80x4.

Motivation of using 3) src was to adapt model to specificity of acoustical environment of public place and mic model. So, taking this into consideration, would my approach be the best one in this case? What do you think? Any tips on that? By the way, cleaning and denoising of records have also been used.

Does these results generally mean that the third source of data just has not good enough quality to train on it at all? I supposed, even in this case there should be some minor improvements on TER from epoch to epoch, however not.

I would really appreciate any tips and your thoughts on that, guys. Any minor advice would help a lot!

lunixbochs commented 5 years ago

I've been doing quite a bit of training for months using the librispeech conv glu recipe. You're using that one right, not the one from the tutorial?

My general methodology for training on extremely large datasets is:

  1. Make a minimal train set (which I've been calling a dev set) that is only ~10% of each train set.
  2. Train on only this dev set with itersave=true until my TER starts to converge - if the TER gets into the 70-80% range, immediately stop training (if I overshoot, copy the 70-80% iter save model back to 001_model_last.bin before continuing training).
  3. Add in the full training set, and continue training.

Note: I mix in all of my datasets from the very beginning, and just use the subset to make it start to converge earlier, as I've had issues converging quickly on very large datasets.

dmzubr commented 5 years ago

I've been doing quite a bit of training for months using the librispeech conv glu recipe. You're using that one right, not the one from the tutorial?

Yes, I was using architecture from conv glu recipe.

Note: I mix in all of my datasets from the very beginning, and just use the subset to make it start to converge earlier, as I've had issues converging quickly on very large datasets.

I was checking this approach too. But got no success even with this.

Made another experiment: check convergence of training of other architectures on my data. And most of them converged. So - I decided try to create compromise arch and make some tests with it.

Will write here when get some results.

Anyway, thanks for sharing your experience and good luck with your experiments!

adamchant commented 4 years ago

Hey @lunixbochs , I was running the tutorials experiment all this while but was not getting that satisfactory results, and wanted to jump to more advanced recipes, I started running the librispeech conv glu recipe. I had a few doubts tho. I ran it for 100 hour libri speech, and for one of the epochs the stats came out as: I0406 13:29:30.461580 9837 Train.cpp:340] epoch: 1 | nupdates: 2 | lr: 0.000150 | lrcriterion: 0.000002 | runtime: 00:00:10 | bch(ms): 5432.09 | smp(ms): 95.11 | fwd(ms): 1899.35 | crit-fwd(ms): 92.46 | bwd(ms): 1853.59 | optim(ms): 1110.92 | loss: 26.30999 | train-TER: 99.19 | train-WER: 97.60 | dev-clean-loss: 26.33119 | dev-clean-TER: 99.09 | dev-clean-WER: 98.37 | avg-isz: 742 | avg-tsz: 125 | max-tsz: 202 | hrs: 0.02 | thrpt(sec/sec): 5.46 I'm a little skeptical because it says hrs: 0.02, but the lst file i am using is 'train-clean-100.lst', also the runtime isnt consistent, as in it isnt taking 10 seconds to run the epoch, can you tell me if its training normally...? or is there some issue... Thanks

lunixbochs commented 4 years ago

That’s called linseg. Ignore it. Wait a few epochs before judging it fully. Also you should know that only training on 100h won’t develop a very good model.

adamchant commented 4 years ago

Yup, I am running this only to test if its working fine, I have another data set... So I can ignore the hrs: 0.02 value right... I am only worried that its for some reason considering only 0.02 hours of the data.. Thanks will train it on the actual data and see how it fares...

lunixbochs commented 4 years ago

Linseg is a single step that runs on a small amount of data. Ignore it.

adamchant commented 4 years ago

Hey @lunixbochs , How much time would u estimate one epoch to take... I'm running on 700 hours of data, on a single 16 gb Tesla T4 gpu... its been almost a day and a half... and its still in its first epoch... Train logs are: I0407 19:53:27.880614 17358 Train.cpp:250] [Network Params: 208863942] I0407 19:53:27.880620 17358 Train.cpp:251] [Criterion] AutoSegmentationCriterion I0407 19:53:28.021339 17358 Train.cpp:259] [Network Optimizer] SGD (momentum=0.8) I0407 19:53:28.021375 17358 Train.cpp:260] [Criterion Optimizer] SGD I0407 19:53:28.021605 17358 Train.cpp:274] [Criterion] LinearSegmentationCriterion (for first 1 updates) I0407 19:53:28.030750 17358 Train.cpp:287] [Network Optimizer] SGD (momentum=0.8) (for first 1 updates) I0407 19:53:28.030774 17358 Train.cpp:290] [Criterion Optimizer] SGD (for first 1 updates) I0407 19:53:30.832384 17358 W2lListFilesDataset.cpp:141] 527562 files found. I0407 19:53:30.842236 17358 Utils.cpp:102] Filtered 0/527562 samples I0407 19:53:30.890646 17358 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 131891 I0407 19:53:31.209738 17358 W2lListFilesDataset.cpp:141] 57482 files found. I0407 19:53:31.210892 17358 Utils.cpp:102] Filtered 0/57482 samples I0407 19:53:31.215188 17358 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 14371 I0407 19:53:31.228219 17358 Train.cpp:557] Shuffling trainset I0407 19:53:31.245193 17358 Train.cpp:564] Epoch 1 started! I0407 20:54:27.380899 17358 Train.cpp:342] epoch: 1 | nupdates: 2 | lr: 0.000150 | lrcriterion: 0.000002 | runtime: 00:00:08 | bch(ms): 4241.80 | smp(ms): 58.08 | fwd(ms): 1566.25 | crit-fwd(ms): 67.73 | bwd(ms): 1484.28 | optim(ms): 790.92 | loss: 21.01307 | train-TER: 98.41 | train-WER: 99.00 | /home/developer/speech_corpus/english_asr_corpus/lists/test_dummy_wo_bab.lst-loss: 19.99951 | /home/developer/speech_corpus/english_asr_corpus/lists/test_dummy_wo_bab.lst-TER: 98.31 | /home/developer/speech_corpus/english_asr_corpus/lists/test_dummy_wo_bab.lst-WER: 97.58 | avg-isz: 569 | avg-tsz: 070 | max-tsz: 080 | hrs: 0.01 | thrpt(sec/sec): 5.37 I0407 20:54:31.913743 17358 Train.cpp:700] Finished LinSeg I0407 20:54:31.914067 17358 Train.cpp:557] Shuffling trainset I0407 20:54:31.924686 17358 Train.cpp:564] Epoch 2 started! And it stuck here.... I had another doubt... so in the default train.cfg given.. there's no --iter flag there... how many epochs is it running...? Thanks

Update : It finished one epoch in two days...