Generating translation using the 2021 IWSLT multilingual speech translation model

🐛 Bug

I am trying to use the model that you shared here to generate translations for the speech that I have. But I am getting this error:

Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 414, in cli_main() File "./fairseq/fairseq_cli/generate.py", line 402, in cli_main parser = options.get_generation_parser() File "./fairseq/fairseq/options.py", line 49, in get_generation_parser parser = get_parser("Generation", default_task) File "./fairseq/fairseq/options.py", line 219, in get_parser utils.import_user_module(usr_args) File "./fairseq/fairseq/utils.py", line 489, in import_user_module import_tasks(tasks_path, f"{module_name}.tasks") File "./fairseq/fairseq/tasks/init.py", line 117, in import_tasks importlib.import_module(namespace + "." + task_name) File "/home/anaconda3/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 955, in _find_and_load_unlocked File "", line 665, in _load_unlocked File "", line 678, in exec_module File "", line 219, in _call_with_frames_removed File "./fairseq/examples/speech_text_joint_to_text/tasks/speech_text_joint.py", line 38, in class SpeechTextJointToTextTask(SpeechToTextTask): File "./fairseq/fairseq/tasks/init.py", line 71, in register_task_cls raise ValueError("Cannot register duplicate task ({})".format(name)) ValueError: Cannot register duplicate task (speech_text_joint_to_text)

To Reproduce

I am using the same script that is shared here for evaluation.

python ./fairseq/fairseq_cli/generate.py
   ${MANIFEST_ROOT} \
   --task speech_text_joint_to_text \
   --user-dir ./fairseq/examples/speech_text_joint_to_text \
   --load-speech-only  --gen-subset  test_es_en_tedx \
   --path  ${model}  \
   --max-source-positions 800000 \
   --skip-invalid-size-inputs-valid-test \
   --config-yaml config.yaml \
   --infer-target-lang en  \
   --max-tokens 800000 \
   --beam 5 \
   --results-path ${RESULTS_DIR}  \
   --scoring sacrebleu

There is a loop in the code that makes this issue:

here importing the task:

importlib.import_module('examples.speech_text_joint_to_text.tasks.speech_text_joint')

first calls the init file, here, which will call for the second time:

importlib.import_module('examples.speech_text_joint_to_text.tasks.speech_text_joint')

This causes the duplication in the task registry in my case.

I could solve it by commenting this line.

This happens again for the model registry, and you need to comment this line too.

After solving the previous issue, this is the next one I am encountering:

| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-16 17:44:17 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 414, in cli_main() File "./fairseq/fairseq_cli/generate.py", line 410, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 102, in _main num_shards=cfg.checkpoint.checkpoint_shard_count, File "./fairseq/fairseq/checkpoint_utils.py", line 370, in load_model_ensemble state, File "./fairseq/fairseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task model = task.build_model(cfg.model) File "./fairseq/fairseq/tasks/speech_to_text.py", line 122, in build_model return super(SpeechToTextTask, self).build_model(args) File "./fairseq/fairseq/tasks/fairseq_task.py", line 651, in build_model model = models.build_model(args, self) File "./fairseq/fairseq/models/init.py", line 107, in build_model return model.build_model(cfg, task) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 528, in build_model encoder = cls.build_encoder(args, task) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 414, in build_encoder component=text_encoder, checkpoint=args.load_pretrained_mbart_from File "./fairseq/fairseq/checkpoint_utils.py", line 778, in load_pretrained_component_from_model raise IOError("Model file not found: {}".format(checkpoint)) OSError: Model file not found: /checkpoint/juancarabina/exps/speech_translation/iwslt_2021/checkpoints/mbart_iwslt_12l_finetune_many2many_large_yun_fix.400000updates.SMPL_temperature.mbart_large.TM_VEPOCH.ls0.1.1workers.adam.inv.lr0.001.warmup8000.initlr1e-07.dr0.3.atdr0.1.actdr0.0.wd0.0.maxtok1024.uf4.seed222.entsrc.det.temp1.5.lnemb.ngpu32/checkpoint_best.pt

It seems it is looking for a mBART checkpoint, which is not available. I have to mention that I just one to load the shared checkpoint for generating the translations, and there is no need for the initializing the model with mBART.

This problem happens since there is a path in the checkpoint that should be overwritten. To do so you can add this line of code to the script:

--model-overrides "{'load_pretrained_mbart_from':'/path/to/mbart.pt'}"

you can download the pre-trained mBART model from here.

After solving those issues, this is the next error I am receiving:

| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-17 20:54:50 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT/spm.model'} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | src-pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | tokenizer: {'bpe': None} 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | 'train' has 0.00% OOV 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | SpeechToTextJointDataset(split="train", n_samples=498, prepend_tgt_lang_tag=False, sh uffle=False, transforms=CompositeAudioFeatureTransform( UtteranceCMVN(norm_means=True, norm_vars=True) SpecAugmentTransform(time_warp_w=0, freq_mask_n=2, freq_mask_f=27, time_mask_n=2, time_mask_t=100, time_mask_p=1.0) ), n_frames_per_step=1 0%| | 0/3 [00:00<?, ?it/s] 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT}/spm.model'} Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 413, in cli_main() File "./fairseq/fairseq_cli/generate.py", line 409, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 205, in _main constraints=constraints, File "./fairseq/examples/speech_text_joint_to_text/tasks/speech_text_joint_to_text.py", line 220, in inference_step bos_token=self._infer_tgt_lang_id, File "./venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 187, in generate return self._generate(sample, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 254, in _generate encoder_outs = self.model.forward_encoder(net_input) File "./fairseq/fairseq/sequence_generator.py", line 760, in forward_encoder return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/sequence_generator.py", line 760, in return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript return self.forward_non_torchscript(net_input) File "./fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript return self.forward(encoder_input) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputtransformer.py", line 320, in forward src_tokens, src_lengths, return_all_hiddens=return_all_hiddens File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 149, in forward out = self.w2v_encoder.forward(src_tokens, padding_mask, tbc=True) File "./fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 400, in forward res = self.w2v_model.extract_features(w2v_args) File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 701, in extract_features source, padding_mask, mask=mask, features_only=True, layer=layer File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 528, in forward features = self.feature_extractor(source) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 812, in forward x = conv(x) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "./venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "./venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 298, in forward return self._conv_forward(input, self.weight, self.bias) File "./venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 295, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 1, 10], but got 4-dimensional input of size [456, 1, 1732, 80] instead

This problem happens because of an issue in the preprocessing step. The wav2vec encoder receives the audio instead of the filterbanks. With the new preprocessing script in the latest commit this issue has been resolved.

The next problem that I am encountering now is this one:

| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-17 20:54:50 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT/spm.model'} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | src-pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | tokenizer: {'bpe': None} 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | 'train' has 0.00% OOV 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | SpeechToTextJointDataset(split="train", n_samples=498, prepend_tgt_lang_tag=False, sh uffle=False, transforms=CompositeAudioFeatureTransform( UtteranceCMVN(norm_means=True, norm_vars=True) SpecAugmentTransform(time_warp_w=0, freq_mask_n=2, freq_mask_f=27, time_mask_n=2, time_mask_t=100, time_mask_p=1.0) ), n_frames_per_step=1 0%| | 0/3 [00:00<?, ?it/s] 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT}/spm.model'} Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 414, in cli_main() File "./fairseq/fairseq_cli/generate.py", line 410, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 206, in _main constraints=constraints, File ".fairseq/examples/speech_text_joint_to_text/tasks/speech_text_joint.py", line 220, in inference_step bos_token=self._infer_tgt_lang_id, File "./venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 187, in generate return self._generate(sample, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 254, in _generate encoder_outs = self.model.forward_encoder(net_input) File "./fairseq/fairseq/sequence_generator.py", line 760, in forward_encoder return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/sequence_generator.py", line 760, in return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript return self.forward_non_torchscript(net_input) File "./fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript return self.forward(*encoder_input) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputtransformer.py", line 320, in forward src_tokens, src_lengths, return_all_hiddens=return_all_hiddens File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 152, in forward if out["encoder_padding_mask"] is not None: KeyError: 'encoder_padding_mask'

facebookresearch / fairseq

Generating translation using the 2021 IWSLT multilingual speech translation model #3868

🐛 Bug

To Reproduce