facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.49k stars 6.41k forks source link

Getting ValueError while running wav2vec2.0 ASR inference with Wav2Vec 2.0 Base (no finetuning split) model #3202

Closed tushar-rishav closed 3 years ago

tushar-rishav commented 3 years ago

🐛 Bug

I am running ASR inference with the latest commit of master (da83e2f3) and Wav2Vec 2.0 Base (no finetuning split) model available under the list of pre-trained models. I am seeing the following error:

To Reproduce

  1. Run cmd

    $ export MODEL=/workspace/fairseq-model/wav2vec2.0/wav2vec_small.pt
    $ export RESULT_DIR=/workspace/fairseq-results/sclite
    $ python3 examples/speech_recognition/infer.py /workspace/fairseq/data/librispeech 
    --task audio_pretraining \
    --nbest 1 \
    --path $MODEL \
    --gen-subset all \
    --results-path $RESULT_DIR \
    --w2l-decoder viterbi \
    --criterion ctc \
    --labels ltr \
    --max-tokens 3200000 \
    --user-dir examples/speech_recognition \
    --post-process letter

    The manifest files

    $ cat /workspace/fairseq/data/librispeech/all*
    T O | F A D E | A W A Y | L I K E | M O R N I N G | B E A U T Y | F R O M | H E R | M O R T A L | D A Y | D O W N | B Y | T H E | R I V E R | O F | A D O N A | H E R | S O F T | V O I C E | I S | H E A R D | A N D | T H U S | H E R | G E N T L E | L A M E N T A T I O N | F A L L S | L I K E | M O R N I N G | D E W |
    /workspace/data/librispeech/908-157963-0000.wav  201920
    TO FADE AWAY LIKE MORNING BEAUTY FROM HER MORTAL DAY DOWN BY THE RIVER OF ADONA HER SOFT VOICE IS HEARD AND THUS HER GENTLE LAMENTATION FALLS LIKE MORNING DEW
  2. See error

INFO:__main__:Namespace(all_gather_list_size=16384, autoregressive=False, azureml_logging=False, batch_size=None, batch_size_valid=None, beam=5, beam_size_token=100, beam_threshold=25.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='ctc', curriculum=0, data='/workspace/fairseq/data/librispeech', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, dump_emissions=None, dump_features=None, empty_cache_freq=0, enable_padding=False, eos=2, eval_wer=False, eval_wer_post_process='letter', eval_wer_tokenizer=None, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='all', heartbeat_timeout=-1, iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, kenlm_model=None, kspmodel=None, labels='ltr', lenpen=1, lexicon=None, lm_path=None, lm_weight=0.0, load_checkpoint_on_all_dp_ranks=False, load_emissions=None, localsgd_frequency=3, log_format=None, log_interval=100, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sample_size=None, max_tokens=3200000, max_tokens_valid=3200000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, min_sample_size=None, model_overrides='{}', model_parallel_size=1, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_repeat_ngram_size=0, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_shards=1, num_workers=1, optimizer=None, optimizer_overrides='{}', pad=1, path='/workspace/fairseq-model/wav2vec2.0/wav2vec_small.pt', patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, post_process='letter', prefix_size=0, print_alignment=None, print_step=False, profile=False, quantization_config_path=None, quiet=False, replace_unk=None, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', results_path='/workspace/fairseq-results/sclite', retain_dropout=False, retain_dropout_modules=None, retain_iter_history=False, rnnt_decoding_type='greedy', rnnt_len_penalty=-0.5, sacrebleu=False, sample_rate=16000, sampling=False, sampling_topk=-1, sampling_topp=-1.0, save_dir='checkpoints', save_interval=1, save_interval_updates=0, score_reference=False, scoring='bleu', seed=1, shard_id=0, sil_weight=0.0, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, suppress_crashes=False, task='audio_pretraining', temperature=1.0, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', unit_lm=False, unk=3, unk_weight=-inf, unkpen=0, unnormalized=False, user_dir='examples/speech_recognition', valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, w2l_decoder='viterbi', wandb_project=None, warmup_updates=0, wer_args=None, wer_kenlm_model=None, wer_lexicon=None, wer_lm_weight=2.0, wer_word_score=-1.0, wfstlm=None, word_score=1.0, zero_infinity=False, zero_sharding='none')
INFO:__main__:| decoding with criterion ctc
INFO:__main__:| loading model(s) from /root/host/zxpan/fairseq-model/wav2vec2.0/wav2vec_small.pt
INFO:fairseq.data.audio.raw_audio_dataset:loaded 1, skipped 0 samples
INFO:__main__:| /workspace/fairseq/data/librispeech all 1 examples
Traceback (most recent call last):                                                                                                                                                
  File "examples/speech_recognition/infer.py", line 428, in <module>
    cli_main()
  File "examples/speech_recognition/infer.py", line 424, in cli_main
    main(args)
  File "examples/speech_recognition/infer.py", line 349, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/workspace/fairseq/fairseq/tasks/fairseq_task.py", line 454, in inference_step
    models, sample, prefix_tokens=prefix_tokens, constraints=constraints
  File "/workspace/fairseq/examples/speech_recognition/w2l_decoder.py", line 87, in generate
    return self.decode(emissions)
  File "/workspace/fairseq/examples/speech_recognition/w2l_decoder.py", line 118, in decode
    B, T, N = emissions.size()
ValueError: not enough values to unpack (expected 3, got 2)

Code sample

NA

Expected behavior

I am expecting to see the inference results.

Environment

Additional context

Please note that all.tsv file format above is slightly different from the wav2vec2_manifest version. I had to make minor modifications to fairseq/data/audio/raw_audio_dataset.py to support fine-tuning over multiple audio directories. Unlike valid subsets, I couldn't find a way to provide a comma separated list of training subsets while reading the doc (please let me know if there's a clean approach? :) ) I am reasonably sure the following change didn't introduce the bug because I am seeing the error after reverting the changes as well.

diff --git a/fairseq/data/audio/raw_audio_dataset.py b/fairseq/data/audio/raw_audio_dataset.py
index ac5acd03..d168ca57 100644
--- a/fairseq/data/audio/raw_audio_dataset.py
+++ b/fairseq/data/audio/raw_audio_dataset.py
@@ -157,7 +157,8 @@ class FileAudioDataset(RawAudioDataset):

         skipped = 0
         with open(manifest_path, "r") as f:
-            self.root_dir = f.readline().strip()
+            ## commented by Tushar to support merged IITM + Alphonso dataset
+            #self.root_dir = f.readline().strip()
             for i, line in enumerate(f):
                 items = line.strip().split("\t")
                 assert len(items) == 2, line
@@ -173,7 +174,7 @@ class FileAudioDataset(RawAudioDataset):
     def __getitem__(self, index):
         import soundfile as sf

-        fname = os.path.join(self.root_dir, self.fnames[index])
+        fname = self.fnames[index] #os.path.join(self.root_dir, self.fnames[index])
         wav, curr_sample_rate = sf.read(fname)
         feats = torch.from_numpy(wav).float()
         feats = self.postprocess(feats, curr_sample_rate)
ValeryNikiforov commented 3 years ago

I met the same error. But I'm tried to use my own validation data and loaded Wav2Vec 2.0 Large (no finetuning).

tushar-rishav commented 3 years ago

@ValeryNikiforov Yes, it didn't work with wav2vec2 large (no fine-tuning) as well. I think none of the no finetuned models work with the latest codebase.

alexeib commented 3 years ago

you cannot run inference on "no-finetuning" models. those have no concept of decoding into any sort of vocab since they were trained purely on audio data. you need to either finetune them yourself (on some dataset) or use one of the finetuned models for inference

Toan-it-mta commented 3 years ago

you cannot run inference on "no-finetuning" models. those have no concept of decoding into any sort of vocab since they were trained purely on audio data. you need to either finetune them yourself (on some dataset) or use one of the finetuned models for inference

It doesn't work when I have finetune but not using language model ?

fayez94 commented 5 months ago

whenever I run the wave2vec2 model on my corpus which is code-mixed Bangla and English I was getting Key error 0. how can I solve that error?