Open patrickvonplaten opened 2 years ago
Gently pinging @sravyapopuri388 and @alexeib
Thanks for the ping @patrickvonplaten. I will look into this and get back to you.
Hey @sravyapopuri388, any updates on this by any chance? :-)
Hi, I tried decoding the model using the following command from the wiki and the results are good. Could you please recheck your setup. Thanks!
$subset=dev_other
python3 examples/speech_recognition/infer.py $DATA_DIR --task audio_finetuning \
--nbest 1 --path $CKPT --gen-subset $SUBSET \
--results-path $result_path --w2l-decoder viterbi \
--criterion ctc --labels $LABELS --max-tokens 0 \
--post-process letter --word-score -1 --sil-weight 0 --batch-size 1
Hey @sravyapopuri388,
Sorry I don't have access to /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw
or dev_other
or kenlm.bin
so it's not possible for me to run this command.
If possible, it would be great if you could post a command that shows how the model gives good results on a single audio file without a language model - this would be super helpful for the community to use these models.
Could you maybe check the above commands to see if CTC without a language model works correctly?
Do you know which dictionary was used for the model?
Hi @patrickvonplaten, updated the above command to not use language model and still works correctly. I used the dictionary open sourced in the wav2vec README here
To run with a single audio file, you can format it in the wav2vec data format and run the above command.
Hey @sravyapopuri388, thanks for the pointers - I used the wrong dictionaries :sweat_smile: . Decoding now works as expected!
Hey @patrickvonplaten, Can you post the command which worked for you ? Thanks.
The very first command actually worked correctly @rahulshivajipawar
There is also a HF implementation now: https://github.com/huggingface/transformers/pull/16812
@patrickvonplaten Hello Is their any pretraining and finetuning notebook for Wav2Vec-Conformer using transformers
🐛 Bug
Wav2Vec2's newly released fine-tuned conformer checkpoints (see here) don't produce reasonable results on an example of Librispeech.
I'm not sure if the model requires a different
To Reproduce
Download 960h fine-tuned checkpoint:
wget https://dl.fbaipublicfiles.com/fairseq/conformer/wav2vec2/librilight/LL_relpos_PT_960h_FT.pt
Download Librispeech Dict:
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt
Load a sample of the Librispeech clean dataset for inference. You can load a dummy sample via the Hugging Face Hub
pip install datasets
The output is a tensor of shape [seq_len, 1, vocab_size]. We are interested in the most likely token for each time step. So we can take the argmax:
and create a decoder
Now we can decode the output and compare it to the correct output:
As we can see the prediction is wrong:
The correct transcription is:
Also from looking at the predicted ids of the model (the argmax logits):
It does seems like there is something wrong with the model and not just the dictionary. There is no overwhelmingly present id which could represent silence.
Expected behavior
The model should work correctly here.
Environment
pip
, source): pip (as shown in Readme)Additional context