Open phamvandan opened 3 years ago
Hi @phamvandan, that's a tough question to answer without running some experiments :). What optimizer are you using? Is the training loss not converging either?
If you're using SGD, I would run experiments lowering its learning rate by factors of 10, so 0.1, 0.01, 0.001, 0.0001 .... Also, from my experience, Adam optimizer with the default parameters is a good place to start with new experimentation.
One more thing: are you fine tuning the wav2vec features with the whole net or not? First start with frozen wav2vec features.
I refered from Mr Mai Long, who reproduced wav2vec features as input https://github.com/mailong25/vietnamese-speech-recognition
I think better to ask directly Mr Mai Long how he reproduced then. As far as I know in original paper they use frozen wav2vec features.
I have trained conv_glu (wav2letter) 2016 with feature extracted from wav2vec model. I choose the learning rate = 1.0 and batchsize = 36 with dataset over 500 hours voice audio. But WER didn't converge, so what is good learning rate and batchsize for training conv_glu (wav2letter) 2016 with feature extracted from wav2vec model.