facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

10 minutes of spanish with transcriptions #2889

Closed Adportas closed 2 years ago

Adportas commented 4 years ago

Hi, I have a 10 minutes of spanish audios with it's transcriptions and want to replicate similar results as the paper "wav2vec 2.0: A Framework for Self-SupervisedLearning of Speech Representations". Can i use the English models listed here "https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md". Should anyone provide some guidance in the commands required, i'am confused if only fine tuning are required and the right parameters and structure of files. Thank's in advance! Daniel

ahsanmemon commented 4 years ago

In my view, the model should train faster with the same language. So the model should give you some results with 10 minutes of English rather than 10 minutes of Spanish.

Adportas commented 4 years ago

In my view, the model should train faster with the same language. So the model should give you some results with 10 minutes of English rather than 10 minutes of Spanish.

Hi ahsanmemon Thanks for your comment, my future intention is to use everything in Spanish, but at the moment I don't have the hardware resources to get it, so first I want to learn the mechanics, the information is confusing to me between the paper and the readme, if I did it in English to learn, Can i do a finetuning of this model "Wav2Vec 2.0 Large (LV-60) * No finetuning Libri-Light download" (https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt) with 10 minutes of English tagged?, do you know what configuration options and files are adequate for it to work as expected?, I am very interested in experimenting but need a little help to get it to work. Thank's in advance! Daniel

ahsanmemon commented 4 years ago

In my view, the model should train faster with the same language. So the model should give you some results with 10 minutes of English rather than 10 minutes of Spanish.

Hi ahsanmemon Thanks for your comment, my future intention is to use everything in Spanish, but at the moment I don't have the hardware resources to get it, so first I want to learn the mechanics, the information is confusing to me between the paper and the readme, if I did it in English to learn, Can i do a finetuning of this model "Wav2Vec 2.0 Large (LV-60) * No finetuning Libri-Light download" (https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt) with 10 minutes of English tagged?, do you know what configuration options and files are adequate for it to work as expected?, I am very interested in experimenting but need a little help to get it to work. Thank's in advance! Daniel

Thanks for the message. Unfortunately, I am also merely starting out with Wav2Vec2.0 so I might not be able to help much with the configuration options as I am myself trying to get a hold of them. Best of luck anyway :)

R4ZZ3 commented 4 years ago

Hi,

I am in the works of figuring this out as well. I quess that the pre-training should be done in spanish and finetuning also. I have started to look into to acquiring speech in finnish and then try to fine tune it- I quess the pretraining audio should be in 10 s samples if I am not mistaken? But do you know how the annotation of wav --> text should be done for the fine tuning?

Rasmus

AlanYen commented 3 years ago

I have tried fine-tune the pre-trained model (wav2vec_vox.pt) with our in-house data (Spanish 80 hours / Russian 300 hours).

After several failures, I finally got good enough result. The first key is that you must tune --mask-prob and --mask-channel-prob according to your data. Maybe you can disable both of them first. The second key is that you must set --freeze-finetune-updates larger, ex 1/3 or 1/2 -max_updates.

I think that 10 minutes is not enough if the fine-tune data is not English and maybe need up to 100 hours. When using more data, the result is better.

Adportas commented 3 years ago

I have tried fine-tune the pre-trained model (wav2vec_vox.pt) with our in-house data (Spanish 80 hours / Russian 300 hours).

After several failures, I finally got good enough result. The first key is that you must tune --mask-prob and --mask-channel-prob according to your data. Maybe you can disable both of them first. The second key is that you must set --freeze-finetune-updates larger, ex 1/3 or 1/2 -max_updates.

I think that 10 minutes is not enough if the fine-tune data is not English and maybe need up to 100 hours. When using more data, the result is better.

Hi @AlanYen It is very interesting to read your comment, it gives me hope that I can use it. Could you share your settings files with us? Best regards! Daniel

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!