Closed hslee4716 closed 2 years ago
For Korean you would want to measure CER (character error rate not word error rate) - add model.use_cer = True in the model config.
What are the average audio durations ? If the average audio durations is less than 10 seconds, you want to reduce spec augment time masks to 2 instead of 10.
To speed up convergence you should pass load the encoder weights from a pretrained English model - example of how to load the pretrained checkpoint and load weights partially are in the tutorial for ASR finetuning on another language.
Thanks for the advice! I'll try the tutorial
The model seems to converge much faster rate than before. And it start to babble!
But it still seems to take a lot of time :disappointed_relieved:
Hi @titu1994 , I'm currently trying out CTC-Conformer on Japanese and I set model.use_cer=True
. Despite this, the model summary still gives a WERBPE
Type on the model summary table (under _wer
). What could possibly be the issue? I read somewhere that the constructor is _wer
, so why is it not model._wer.use_cer = True
. I tried both but the summary table doesn't change to CER. Thanks for your help!
The name of the class won't change, nor will the log name. Sadly switching the name of the log would crash exp manager unless you were careful to update the metric being monitored.
But rest assured if your config has cfg.model.use_cer = True, then it is computing cer. You can do it in code as well after the model has been built as shown in the tutorial - however there is a difference for CTC vs RNNT models - CTC has model._wer, RNNT had model.wer.
When you use model.summarize() you can see the result of the modules that exist and it will detail the way to access the WERBPR metric (via _wer or wer)
We'd recommend using the config method, simply so that after training and inference the model config tells the restored model to again use CER instead of WER mode. If you make code change only, it will continue to use wer after restoration and give different results.
I was worried that it wasn't computing for CER when I saw the first epoch predictions being blanks, but after leaving it for several hours it's now putting out reasonable outputs and the losses and "WER" are decreasing. Thank you!
Right the initial "blank" token prediction is common in ASR, the loss encourages the model to first predict blank every timesteps then replace blanks with actual subwords/chars
@hslee4716 Did u solved the problem? Im training with korean dataset with same conformer-ctc model and my training loss and wer is exactly same as you. It doesn't decrease less then 100. wer is near 100%.
@wonwooo It's been a while so I don't remember much, but I remember that, as @titu1994 titu said at the time, I changed WER to CER, followed the tutorial faithfully, trained at an appropriate learning rate and enough batch. At that time, It tested on temporal server, i cannot check detail now. sry..
I am trying to train the conformer-ctc model with the Ksponspeech dataset, which is a Korean speaking dataset.
Ksponspeech - 1000hours / 123GB / 630000 pcm audio files ( fs=16000 / sample_width = 2 / channels=1)
In the first try, with the same settings as the existing conformer ctc configuration guidelines, only about 5000 vocabs were set and trained for Korean. It took about 4 hours per epoch, and when about 5 epochs progressed, the loss hardly decreased, and it was confirmed that the WER was fixed at almost100%. As a result of inference, Only spaces or comma(.) were output
In the second try, for the test, the dataset was reduced to about 6300, and the vocab size was also reduced to 3200, but the result of training for about 20 hours ( 460 epochs ), models seems never convergence.
After that, I continue to test by changing the hyper parameters little by little, but a similar phenomenon continues to occur.
Are there any problems you can guess?
Is it simply a lack of training time?
Also, I would like to know how much the learning rate affects the model training when using the noam scheduler.