SeanNaren / deepspeech.pytorch

Speech Recognition using DeepSpeech2.
MIT License
2.1k stars 621 forks source link

Chinese Mandarin Speech Recognition is not working well #320

Closed gentaiscool closed 6 years ago

gentaiscool commented 6 years ago

I am trying to train a model with Chinese data (HKUST Mandarin 150 hrs). After several epochs, the CER is not changing around 52%. I generated my labels from the dataset (Chinese characters and alphanumeric, first element is "_") and used 3e-4 learning rate. The original audio is stereo 8khz, then I converted them into mono 8khz. Do I need to change the data processing on the code to make it work? Thank you

DataParallel( (module): DeepSpeech( (conv): Sequential( (0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(0, 10)) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): Hardtanh(min_val=0, max_val=20, inplace) (3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1)) (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): Hardtanh(min_val=0, max_val=20, inplace) ) (rnns): Sequential( (0): BatchRNN( (rnn): GRU(32, 800, bias=False, bidirectional=True) ) (1): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bias=False, bidirectional=True) ) (2): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bias=False, bidirectional=True) ) (3): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bias=False, bidirectional=True) ) (4): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bias=False, bidirectional=True) ) ) (fc): Sequential( (0): SequenceWise ( Sequential( (0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): Linear(in_features=800, out_features=3757, bias=False) )) ) (inference_softmax): InferenceBatchSoftmax() ) )

ryanleary commented 6 years ago

You might consider sweeping the learning rate or trying different architectures. 150 hours isn't very much, so you may need to make some concessions in terms of model size/number of parameters

gentaiscool commented 6 years ago

Thank you. After I changed the architecture, it started to work. Now I can reach 41% CER after 10 epochs.

minushuang commented 6 years ago

@gentaiscool
Hi, I got a high loss(600+) with the default architectures to train my Mandarin model. the data set is THCHS30, about 30 hours, and the dataset is too least. can u share your changed architectures ? thank u.

miguelvr commented 6 years ago

@minushuang The Deepspeech architectures are very data-hungry. I think if you have so little data you'll get better results with more traditional approaches such as GMM-HMM

gentaiscool commented 6 years ago

@minushuang I used 4 layers with 400 hidden size each. Let me know if anything works for you.

minushuang commented 6 years ago

@gentaiscool OK ,I will have a try.