How can I use XLSR-53 with fairseq?

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.42k stars 6.4k forks source link

How can I use XLSR-53 with fairseq? #3275

Open PhilipMay opened 3 years ago

PhilipMay commented 3 years ago

Hi, I want to translate german voice to text and use the XLSR-53 model. That model is mentioned here: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

But the usage example from here: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md#example-usage Is not realy helpful.

Can you please help me how to lead and use the XLSR-53 model to convert german .wav file to text?

Thanks Philip

olafthiele commented 3 years ago

Hi Philip, nice to see you here, you will need to finetune the model with good German material before you can use it for sth useful. This repo has a good overview and scripts how to do that.

guillefix commented 3 years ago

I imagine the recommended amounts of data mentioned in that repo won't appy for XLSR-53, as they used wav2vec_small? With XLS3-53, we could bypass all the language-specific pretraining no? Also not sure why they use a language model for. I think in the ASR implementations I've seen using wav2vec, they just use a simple linear classifier on top of the wav2vec represenation, trained on the small supervised dataset. I could be wrong, but it seems like it should be much simpler than what that repo does, given XLSR-53

PhilipMay commented 3 years ago

Hi Philip, nice to see you here, you will need to finetune the model with good German material before you can use it for sth useful. This repo has a good overview and scripts how to do that.

Hi @olafthiele - nice to meet you here. :-)

Do you have any experience with these models for german language?

olafthiele commented 3 years ago

Yep, just trained a couple of models with different params. Have written you an email. For all the others and reference, use the repo mentioned above (thanks @mailong25) to begin with and adapt params to your needs. And as always: gold in, gold out or s...t in, s...t out :-)

Gorodecki commented 3 years ago

@olafthiele, please, share which model was used as a base. What parameters were set for fine-tuning? How many hours of audio? Whether augmentation was used? I want to train a model for the Russian language. I have 50 hrs labeled noise audio.

olafthiele commented 3 years ago

@Gorodecki, we used the published XSLR-53 model together with the 100h base config values for first tests with fewer steps. But that depends on your data. We are testing anything between 10 mins and 500 hours with different configs. No augmentation. Our data is not very noisy, maybe clean it in advance? And if you do Russian, have you tried @snakers4 models?

Gorodecki commented 3 years ago

@olafthiele thank you! :+1: This repo it is dataset, but not models.

olafthiele commented 3 years ago

Sorry, not writing a book here :-) Check him, there are models somewhere. Not wav2vec, but Russian and easy to start with last time I checked. And maybe the data helps you too.

PhilipMay commented 3 years ago

Hey @olafthiele are you aware of this: https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german

They claim to have a WER (word error rate) of 18.5 % on the german commen voice corpus (test set).

But I do not know how they trained it...

olafthiele commented 3 years ago

Common Voice is a good start, but if possible, find better material for finetuning or maybe use a a more suitable language model.

alexeib commented 3 years ago

trying out the hugging face implementation would be a great start. their demo is super nice. in terms of "how they trained it" - they actually just took our published model.

one downside is i dont think they use a language model for decoding but that can be added on top after you get the argmax decoding working

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!