facebookresearch / muavic

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Other
353 stars 30 forks source link

Noise parameters for decoding and training #15

Closed roudimit closed 8 months ago

roudimit commented 10 months ago

I am trying to figure out the noise parameters for the decode and train script to reproduce the results in the paper. For decoding, I originally tried adding babble noise from musan:

override.noise_wav=/path-to-musan/musan/tsv/babble \
override.noise_prob=1 \
override.noise_snr=0

I found the average performance of the monolingual and multilingual models in the noisy condition was noticeably better than reported in the paper (while obtaining similar results as to the paper in clean conditions). I also tried using the babble noise from lrs3 (override.noise_wav=/path-to-lrs3/noise/babble), and the average performance was closer to what was reported in the paper. Which noise should be used?

For training, are these the right parameters to add?

override.noise_wav=/path-to-musan/musan/tsv/all \
override.noise_prob=0.25 \
override.noise_snr=0

Also, for the pre-trained model ("All models FT from strongest large_vox_iter5.pt") is this the noisy pre-trained checkpoint or clean pre-trained checkpoint? I assume it's the noisy one, but just double checking.

Thanks for the help!

Anwarvic commented 8 months ago

HI @roudimit,

So sorry for the late reply. The following are the answers to your questions:

Which noise should be used?

The babble noise used in lrs3 is created from English-only data. Our setup is multilingual, so we created more challenging noise samples using multilingual data. Hence, the worse performance.

For training, are these the right parameters to add?

Yes! Only change override to task and add them to train.sh:

task.noise_wav=[path-to-multilingual-babble]
task.noise_prob=0.25 # this means 25% of audio samples will be noisy
task.noise_snr=0 #the Signal-to-noise ratio is 0 (signal is as loud as noise).

When decoding, use override yes and change noise_prob=1 instead of 0.25.

is large_vox_iter5.pt the noisy pre-traied checkpoint?

Yes, as mentioned here it's noise-augmented and pre-trained on LRS3 + VoxCeleb2 (En).

Hope this is helpful! Feel free to close the issue if this resolves your issue.

roudimit commented 8 months ago

Hi @Anwarvic, thanks for the clarifications! Can you explain the process to create the multilingual noise? I want to be sure the noise is the same for a fair comparison. Is each language tested with babble noise created from that language only, or babble noise created from all the languages? Thanks!

Anwarvic commented 8 months ago

Hi @roudimit ,

Can you explain the process of creating the multilingual noise?

Sure, I created the multilingual-babble noise similar to how it's shown here. Regarding the values of num_samples and min_len, I used 5 and 30 respectively (as I remember).

is each language tested with babble noise created from that language only?

So, the noise was created by mixing different utterances from the same language. For example, to create babble_ar.wav, I mix different utterances from that language (i.e. ar) only. However, when decoding I choose one utterance randomly independent on the language.

Hope this helps!

roudimit commented 8 months ago

Hi @Anwarvic, I appreciate the follow-up reply and all the help! Can I double check my understanding with you?

Is that right? Thank you for clearing it up!

Also, it seems that usually 30 is used for num_samples, do you mean num_samples was 30 and min_len was 5?

Anwarvic commented 8 months ago

Hi @roudimit,

roudimit commented 8 months ago

Hi @Anwarvic thanks for the reply! That clears it up.