ASR for German (or any other language that is not English)

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 372 forks source link

ASR for German (or any other language that is not English) #497

Open xaiguy opened 4 years ago

xaiguy commented 4 years ago

Hello community,

I'm currently trying to transfer the success of end-to-end CTC ASR models for English to public German speech datasets and am not seeing the same kind of results.

The dataset that I use is Tuda-de (127 hours, 147 speakers). I also tried a version with additional training data (~300 hours total). I wanted to use DS2 as a baseline and had >70% WER after 200000k training steps. Now I switched to Jasper10x5 with augmentation and currently am scratching 70% WER at 50 epochs. I'll continue training until 400 epochs, but it doesn't feel right. The public SOTA for that dataset is ~28% WER without additional data and ~13% with additional data.

I'm kind of curious if anyone has tried Jasper for non-English data yet and can share if there were any specific problems and/or how long he/she trained until decent results kicked in?

okuchaiev commented 4 years ago

Can you please share the link to the Tuda-de dataset you are using? Also, 127 hours seems too small for Jasper10x5 - perhaps you could try a smaller version first?

xaiguy commented 4 years ago

Thanks for the quick reply!

Sure, the link to Tuda-de is the following: ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v2.tar.gz

I also thought it might be too small. I then combined it with the M-AILABS corpus for training (http://www.caito.de/data/Training/stt_tts/de_DE.tgz), which is ~230 hours, adding up to ~350. This still might be too small, I guess. I'll continue to train the 10x5 version for a while and try a smaller Jasper model next week.

borisgin commented 4 years ago

Have you looked at Spoken Wikipedia?

xaiguy commented 4 years ago

@borisgin Yes, I have.

From what I've read and heard the last weeks, it seems that other people are facing similar problems with End-to-end ASR for German. It may indeed be that the quantity (and maybe also quality) of the publicly available labeled datasets is too low for completely data-driven models.

Cerebrock commented 4 years ago

It would be great if NVIDIA made avaiable some pre-trained models on other languages... or maybe aid with some free compute :B. I'm planning on training Jasper for Spanish, will stay in touch!

xaiguy commented 4 years ago

Short update: I trained Jasper for a little more than 200 epochs with ~600 hrs of German audio data. At around 960k steps I reached a 38% WER without and 23% with LM rescoring. While this is not anywhere near the SOTA posted above, it would still be satisfying enough to give it a try in production. Unfortunately, I need the model to work on (noisy) telephone speech and the methods that one would use for classical approaches (injecting noise, downsampling) weren't enough to make the model adapt. Since I won't be able to collect enough (labeled) telephone speech, I stopped experimentation with Jasper for now and switched back to Kaldi. I think if I had trained a week or two more, Jasper would have worked pretty well for a clean(!) speech use case.

Cerebrock commented 4 years ago

What about pytorch-kaldi? or Espnet? Or some more sophisticated preprocessing such as denoising AEs? Great to know about your experience, thanks!

tayciryahmed commented 4 years ago

Short update: I trained Jasper for a little more than 200 epochs with ~600 hrs of German audio data. At around 960k steps I reached a 38% WER without and 23% with LM rescoring. While this is not anywhere near the SOTA posted above, it would still be satisfying enough to give it a try in production. Unfortunately, I need the model to work on (noisy) telephone speech and the methods that one would use for classical approaches (injecting noise, downsampling) weren't enough to make the model adapt. Since I won't be able to collect enough (labeled) telephone speech, I stopped experimentation with Jasper for now and switched back to Kaldi. I think if I had trained a week or two more, Jasper would have worked pretty well for a clean(!) speech use case.

Had the same experience for Spanish (+600h, jasper fine-tuning, 150epochs & augmentation), with WER of 14% (using 3-gram LM re-scoring) on common voice Spanish test set but around 30% on noisy telephone data. @xaiguy Did you get any better results using Kaldi? Also, for the fine-tuning stage, did you normalize the accents and German specific characters like ß ? Thanks.

soheiltehranipour commented 3 years ago

Thanks. would you please Notebook file or Colab of your project?

aayushkubb commented 3 years ago

Nothing will beat the actual training on noisy samples. However, doing SpecAugmentation can help you out. Also, the SOTA that has been claimed is on a very cleaned clinical dataset.

I have been able to replicate the same results on Libris dataset without any efforts but to get same numbers in any language or any other dataset that is not in line with Libris moving past 20 is difficult.

Also, the way you create LM will be a key in the move from 20 to 15. Even if you reach 15 its a very good number considering the quality of the data.

How are your audios? Is the transcription quality good? What I have seen is manually annotations are 90-95% accurate. (Libris dataset is very clean and is 99%+ accurate) Then number of steps doesn't matter till you see loss decreasing? Can you show loss trace or tf board trace?