Noise parameters for decoding and training

roudimit commented 10 months ago

I am trying to figure out the noise parameters for the decode and train script to reproduce the results in the paper. For decoding, I originally tried adding babble noise from musan:

override.noise_wav=/path-to-musan/musan/tsv/babble \
override.noise_prob=1 \
override.noise_snr=0

I found the average performance of the monolingual and multilingual models in the noisy condition was noticeably better than reported in the paper (while obtaining similar results as to the paper in clean conditions). I also tried using the babble noise from lrs3 (override.noise_wav=/path-to-lrs3/noise/babble), and the average performance was closer to what was reported in the paper. Which noise should be used?

For training, are these the right parameters to add?

override.noise_wav=/path-to-musan/musan/tsv/all \
override.noise_prob=0.25 \
override.noise_snr=0

Also, for the pre-trained model ("All models FT from strongest large_vox_iter5.pt") is this the noisy pre-trained checkpoint or clean pre-trained checkpoint? I assume it's the noisy one, but just double checking.

Thanks for the help!

Anwarvic commented 8 months ago

HI @roudimit,

So sorry for the late reply. The following are the answers to your questions:

Which noise should be used?

The babble noise used in lrs3 is created from English-only data. Our setup is multilingual, so we created more challenging noise samples using multilingual data. Hence, the worse performance.

For training, are these the right parameters to add?

Yes! Only change override to task and add them to train.sh:

task.noise_wav=[path-to-multilingual-babble]
task.noise_prob=0.25 # this means 25% of audio samples will be noisy
task.noise_snr=0 #the Signal-to-noise ratio is 0 (signal is as loud as noise).

When decoding, use override yes and change noise_prob=1 instead of 0.25.

is large_vox_iter5.pt the noisy pre-traied checkpoint?

Yes, as mentioned here it's noise-augmented and pre-trained on LRS3 + VoxCeleb2 (En).

Hope this is helpful! Feel free to close the issue if this resolves your issue.

roudimit commented 8 months ago

Hi @Anwarvic, thanks for the clarifications! Can you explain the process to create the multilingual noise? I want to be sure the noise is the same for a fair comparison. Is each language tested with babble noise created from that language only, or babble noise created from all the languages? Thanks!

Anwarvic commented 8 months ago

Hi @roudimit ,

Can you explain the process of creating the multilingual noise?

Sure, I created the multilingual-babble noise similar to how it's shown here. Regarding the values of num_samples and min_len, I used 5 and 30 respectively (as I remember).

is each language tested with babble noise created from that language only?

So, the noise was created by mixing different utterances from the same language. For example, to create babble_ar.wav, I mix different utterances from that language (i.e. ar) only. However, when decoding I choose one utterance randomly independent on the language.

Hope this helps!

roudimit commented 8 months ago

Hi @Anwarvic, I appreciate the follow-up reply and all the help! Can I double check my understanding with you?

(1) For training, you use the MUSAN / LRS3 noise, which is the same as AV-HuBERT("Also, we randomly augment 25% of the input samples with multiple types of additive noises with a SNR (signal-to-noise ratio) of 0. The noise audio clips in the categories of “natural”, “music” and “babble” are sampled from MUSAN dataset [28], while the overlapping “speech” noise samples are drawn from LRS3TED. In creating “speech” and “babble” noise sets, we ensure there are no speaker overlap among different partitions.")

(2) For testing, you generated multilingual babble noise file for each language using speakers only from that language. However, when you decode on a language, you select the babble noise file randomly from all of the languages (ie. when you decode on Arabic, the babble noise file could be from any language, including Arabic or English). So the noise tsv file should contain this:

muavic/noise/babble/babble_en.wav                           
muavic/noise/babble/babble_ar.wav                           
muavic/noise/babble/babble_de.wav
muavic/noise/babble/babble_el.wav                           
muavic/noise/babble/babble_es.wav                           
muavic/noise/babble/babble_fr.wav                           
muavic/noise/babble/babble_it.wav
muavic/noise/babble/babble_pt.wav
muavic/noise/babble/babble_ru.wav

Is that right? Thank you for clearing it up!

Also, it seems that usually 30 is used for num_samples, do you mean num_samples was 30 and min_len was 5?

Anwarvic commented 8 months ago

Hi @roudimit,

Regarding (1), yes we used the same noise files from AV-HuBERT
Regarding (2), yes the tsv file should look similar to what you've pointed out.
No, I used num_samples=5 and min_len=30. It sounded pretty realistic with just 5 samples.

roudimit commented 8 months ago

Hi @Anwarvic thanks for the reply! That clears it up.

facebookresearch / muavic

Noise parameters for decoding and training #15