flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.35k stars 1.01k forks source link

unstable training #584

Open mohamad-hasan-sohan-ajini opened 4 years ago

mohamad-hasan-sohan-ajini commented 4 years ago

Hi

I used Mozilla Common Voice dataset (whole validated data for Persian language, which is about 211 hounrs) to train sota models. I almost used librispeech sota config files (except little changes corresponding to refrain using word piece and use surround flag instead + disable distributed training) to train resnet and TDS models using CTC loss. I also reduced resnet channels and TDS unites hidden sizes as the dataset is about 1/5 of librispeech in size. I also eliminate data augmentation.

The training process is goes well for a while and then the loss/TER/WER becomes unstable and goes to the ceiling:

image

The resnet model (orange curve) after 5 days of training and the TDS model (blue curve) after about 1 day of training become unstable. Is there any reason cause the problem? Is the loss or layer norm computationally unstable? And how can I avoid these kinds of unstability?

The TDS model arch:

V -1 NFEAT 1 0
C2 1 10 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.1 2400
TDS 10 21 80 0.1 2400
C2 10 14 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
C2 14 18 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.2 3600
TDS 18 21 80 0.2 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
V 0 1440 1 0
RO 1 0 3 2
L 1440 NLABEL

and config file parameters:

--batchsize=4
--lr=0.3
--momentum=0.5
--maxgradnorm=1
--onorm=target
--sqnorm=true
--mfsc=true
--nthread=10
--criterion=ctc
--memstepsize=8338608
#--wordseparator=_
#--usewordpiece=true
--surround=|
--filterbanks=80
--gamma=0.5
#--enable_distributed=true
--iter=1500
--stepsize=200
--framesizems=30
--framestridems=10
--seed=2
--reportiters=1000

regargs.

jkkj1630 commented 4 years ago

i got same Issues, i try addition learn rate,seems to improve,but still not resolved.

vineelpratap commented 4 years ago

Hi, We have not seen this issue lately. For a sanity check, could you run an experiment where you filter audio samples < 1sec and target length < 5.

vineelpratap commented 4 years ago

Also, could you let us know the input and target size distribution - min, max, avg, stddev

jkkj1630 commented 4 years ago

sample wav at 5-30 sec

mohamad-hasan-sohan-ajini commented 4 years ago

Also, could you let us know the input and target size distribution - min, max, avg, stddev

Ah, I forgot to filter common voice data (as I did for our dataset) to have bounded length. So there are 2 samples longer than 15 seconds (19.824 and 24.864) and they may cause training instability.

input size distribution is as follows: min: 0.744, max: 24.864, avg: 3.950, std: 1.534 input

target size distribution is as follows: min: 2, max: 197, avg: 31.389, std: 17.039 output

But these samples were in the dataset from the first epoch. That is weird that they make the network unstable after about 200 epochs!

jkkj1630 commented 4 years ago

I tried training on 18,000 hours of Chinese data from many scenes, including phone recordings, news subtitles, TTS voice-changing synthesis, standard reading, wake-up words, multi-person conversations and meeting records. The labeling accuracy of these data is> 95%. Using https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/streaming_convnets/librispeech/am_500ms_future_context.arch, do I need to make any changes to this original sample arch file?

mohamad-hasan-sohan-ajini commented 4 years ago

@jkkj1630 Currently I'm training a model with the same arch file and don't see any instability yet. As the issue is not reproducible, I'll close it. But definitely there are some instability issue cause 25 second long audio should not make the training unstable.

mohamad-hasan-sohan-ajini commented 4 years ago

I get the same issue when training with duration filtered files:

image

image

Training process is stopped with NaN loss:

F0424 23:20:16.014863 218 Train.cpp:564] Loss has NaN values. Samples - common_voice_fa_19219740.mp3_norm,common_voice_fa_19219740.mp3_lowgain

while the duration of both files are 4.632 seconds. It seems the problem is waveform duration agnostic and happens in a not predictable manner.

junaedifahmi commented 4 years ago

Hey, sorry for the non-related question, can I know how you use tensoboard for monitoring the loss?Thank you for your answer. Also I face the same problem with Nan in loss values.

tlikhomanenko commented 4 years ago

@juunnn, some time ago we shared the script to convert the w2l logs into tensorboard format here https://github.com/facebookresearch/wav2letter/issues/528.

junaedifahmi commented 4 years ago

thank you @tlikhomanenko

ali-r commented 2 years ago

optimizer is important in training stability. SGD+grad_clip - is a very stable option