Open PhilipMay opened 3 years ago
Hi Philip, nice to see you here, you will need to finetune the model with good German material before you can use it for sth useful. This repo has a good overview and scripts how to do that.
I imagine the recommended amounts of data mentioned in that repo won't appy for XLSR-53, as they used wav2vec_small? With XLS3-53, we could bypass all the language-specific pretraining no? Also not sure why they use a language model for. I think in the ASR implementations I've seen using wav2vec, they just use a simple linear classifier on top of the wav2vec represenation, trained on the small supervised dataset. I could be wrong, but it seems like it should be much simpler than what that repo does, given XLSR-53
Hi Philip, nice to see you here, you will need to finetune the model with good German material before you can use it for sth useful. This repo has a good overview and scripts how to do that.
Hi @olafthiele - nice to meet you here. :-)
Do you have any experience with these models for german language?
Yep, just trained a couple of models with different params. Have written you an email. For all the others and reference, use the repo mentioned above (thanks @mailong25) to begin with and adapt params to your needs. And as always: gold in, gold out or s...t in, s...t out :-)
@olafthiele, please, share which model was used as a base. What parameters were set for fine-tuning? How many hours of audio? Whether augmentation was used? I want to train a model for the Russian language. I have 50 hrs labeled noise audio.
@Gorodecki, we used the published XSLR-53 model together with the 100h base config values for first tests with fewer steps. But that depends on your data. We are testing anything between 10 mins and 500 hours with different configs. No augmentation. Our data is not very noisy, maybe clean it in advance? And if you do Russian, have you tried @snakers4 models?
@olafthiele thank you! :+1: This repo it is dataset, but not models.
Sorry, not writing a book here :-) Check him, there are models somewhere. Not wav2vec, but Russian and easy to start with last time I checked. And maybe the data helps you too.
Hey @olafthiele are you aware of this: https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german
They claim to have a WER (word error rate) of 18.5 % on the german commen voice corpus (test set).
But I do not know how they trained it...
Common Voice is a good start, but if possible, find better material for finetuning or maybe use a a more suitable language model.
trying out the hugging face implementation would be a great start. their demo is super nice. in terms of "how they trained it" - they actually just took our published model.
one downside is i dont think they use a language model for decoding but that can be added on top after you get the argmax decoding working
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Hi, I want to translate german voice to text and use the XLSR-53 model. That model is mentioned here: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
But the usage example from here: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md#example-usage Is not realy helpful.
Can you please help me how to lead and use the XLSR-53 model to convert german .wav file to text?
Thanks Philip