Evaluating Wav2vec 2.0 with Transformer LM

BaPannier commented 3 years ago

What is your question? How to reproduce the WER improvement obtained by using the proposed Transformer LM instead of Viterbi ?

What have you tried? Files used:

Letter dictionary: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
Wav2vec model: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
Transformer LM: here from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019
LM dict: here + upper-case processing from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019 (dict.txt placed in the same directory than lm_librispeech_word_transformer.pt)

head -3 dict.txt
THE 49059384
AND 26362574
OF 24795903

Command used: python examples/speech_recognition/infer.py /path/to/librispeech --task audio_pretraining --nbest 1 --path /path/to/wav2vec2_vox_960h.pt --gen-subset dev_clean --results-path outputdir --w2l-decoder fairseqlm --lm-model /path/to/lm_librispeech_word_transformer.pt --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000

This produces a WER > 50 while using Viterbi gives ~2 WER. When using this lexicon file (from https://github.com/pytorch/fairseq/issues/2734) by adding this argument --lexicon /path/to/librispeech_lexicon.lst, I get a ~6 WER.

What’s your environment?

fairseq 0.10.0 (latest stable release)
wav2letter branch v0.2 for python bindings + patch from this issue https://github.com/facebookresearch/wav2letter/issues/775 (otherwise imports from w2l_decoder.py will fail due to missing LexiconFreeDecoder)

I don’t know what I did wrong. Thank you fro your answer !

HJJ256 commented 3 years ago

What is your question? How to reproduce the WER improvement obtained by using the proposed Transformer LM instead of Viterbi ?

What have you tried? Files used:

Letter dictionary: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

Wav2vec model: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

Transformer LM: here from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019

LM dict: here + upper-case processing from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019 (dict.txt placed in the same directory than lm_librispeech_word_transformer.pt)
head -3 dict.txt
THE 49059384
AND 26362574
OF 24795903
Command used: python examples/speech_recognition/infer.py /path/to/librispeech --task audio_pretraining --nbest 1 --path /path/to/wav2vec2_vox_960h.pt --gen-subset dev_clean --results-path outputdir --w2l-decoder fairseqlm --lm-model /path/to/lm_librispeech_word_transformer.pt --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000

This produces a WER > 50 while using Viterbi gives ~2 WER. When using this lexicon file (from #2734) by adding this argument --lexicon /path/to/librispeech_lexicon.lst, I get a ~6 WER.

What’s your environment?

fairseq 0.10.0 (latest stable release)

wav2letter branch v0.2 for python bindings + patch from this issue facebookresearch/wav2letter#775 (otherwise imports from w2l_decoder.py will fail due to missing LexiconFreeDecoder)

I don’t know what I did wrong. Thank you fro your answer !

Did you get any solution for this? @BaPannier

yuseungwoo commented 3 years ago

Please Replace 416, 417 lines with below. if you done, It may be work.

examples/speech_recognition/w2l_decoder.py

414 word_idx = self.worddict.index(word) 415 , score = self.lm.score(start_state, word_idx, no_cache=True) 416 for spelling in [list(word+"|")]: 417 if word != "": 418 #print("%s\t%s\t%s"%(word, spelling, spellings))

Guruprasad68 commented 3 years ago

Hi, To recreate the results, I noticed that LM weight, word insertion penalty and beam size also play an important role. They have used a variety of values based on the finetune data, transformer/Ken LM and the set they are decoding. Please refer to the paper, the ablations section will have the values they used for different experiments. I had followed this for the 1hr BASE finetuned model with both 4-gram KENLM and transformer LM I got their results. With LM weight 2 and Word Insertion Penalty, I was getting around 10-15% more .

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

Evaluating Wav2vec 2.0 with Transformer LM #2977