Open mohamad-hasan-sohan-ajini opened 4 years ago
i got same Issues, i try addition learn rate,seems to improve,but still not resolved.
Hi, We have not seen this issue lately. For a sanity check, could you run an experiment where you filter audio samples < 1sec and target length < 5.
Also, could you let us know the input and target size distribution - min, max, avg, stddev
sample wav at 5-30 sec
Also, could you let us know the input and target size distribution - min, max, avg, stddev
Ah, I forgot to filter common voice data (as I did for our dataset) to have bounded length. So there are 2 samples longer than 15 seconds (19.824 and 24.864) and they may cause training instability.
input size distribution is as follows: min: 0.744, max: 24.864, avg: 3.950, std: 1.534
target size distribution is as follows: min: 2, max: 197, avg: 31.389, std: 17.039
But these samples were in the dataset from the first epoch. That is weird that they make the network unstable after about 200 epochs!
I tried training on 18,000 hours of Chinese data from many scenes, including phone recordings, news subtitles, TTS voice-changing synthesis, standard reading, wake-up words, multi-person conversations and meeting records. The labeling accuracy of these data is> 95%. Using https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/streaming_convnets/librispeech/am_500ms_future_context.arch, do I need to make any changes to this original sample arch file?
@jkkj1630 Currently I'm training a model with the same arch file and don't see any instability yet. As the issue is not reproducible, I'll close it. But definitely there are some instability issue cause 25 second long audio should not make the training unstable.
I get the same issue when training with duration filtered files:
Training process is stopped with NaN loss:
F0424 23:20:16.014863 218 Train.cpp:564] Loss has NaN values. Samples - common_voice_fa_19219740.mp3_norm,common_voice_fa_19219740.mp3_lowgain
while the duration of both files are 4.632 seconds. It seems the problem is waveform duration agnostic and happens in a not predictable manner.
Hey, sorry for the non-related question, can I know how you use tensoboard for monitoring the loss?Thank you for your answer. Also I face the same problem with Nan in loss values.
@juunnn, some time ago we shared the script to convert the w2l logs into tensorboard format here https://github.com/facebookresearch/wav2letter/issues/528.
thank you @tlikhomanenko
optimizer is important in training stability. SGD+grad_clip - is a very stable option
Hi
I used Mozilla Common Voice dataset (whole validated data for Persian language, which is about 211 hounrs) to train sota models. I almost used librispeech sota config files (except little changes corresponding to refrain using word piece and use surround flag instead + disable distributed training) to train resnet and TDS models using CTC loss. I also reduced resnet channels and TDS unites hidden sizes as the dataset is about 1/5 of librispeech in size. I also eliminate data augmentation.
The training process is goes well for a while and then the loss/TER/WER becomes unstable and goes to the ceiling:
The resnet model (orange curve) after 5 days of training and the TDS model (blue curve) after about 1 day of training become unstable. Is there any reason cause the problem? Is the loss or layer norm computationally unstable? And how can I avoid these kinds of unstability?
The TDS model arch:
and config file parameters:
regargs.