Open atebbifakhr opened 2 years ago
There is a loop in the code that makes this issue:
here importing the task:
importlib.import_module('examples.speech_text_joint_to_text.tasks.speech_text_joint')
first calls the init file, here, which will call for the second time:
importlib.import_module('examples.speech_text_joint_to_text.tasks.speech_text_joint')
This causes the duplication in the task registry in my case.
I could solve it by commenting this line.
This happens again for the model registry, and you need to comment this line too.
After solving the previous issue, this is the next one I am encountering:
| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-16 17:44:17 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 414, in
cli_main() File "./fairseq/fairseq_cli/generate.py", line 410, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 102, in _main num_shards=cfg.checkpoint.checkpoint_shard_count, File "./fairseq/fairseq/checkpoint_utils.py", line 370, in load_model_ensemble state, File "./fairseq/fairseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task model = task.build_model(cfg.model) File "./fairseq/fairseq/tasks/speech_to_text.py", line 122, in build_model return super(SpeechToTextTask, self).build_model(args) File "./fairseq/fairseq/tasks/fairseq_task.py", line 651, in build_model model = models.build_model(args, self) File "./fairseq/fairseq/models/init.py", line 107, in build_model return model.build_model(cfg, task) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 528, in build_model encoder = cls.build_encoder(args, task) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 414, in build_encoder component=text_encoder, checkpoint=args.load_pretrained_mbart_from File "./fairseq/fairseq/checkpoint_utils.py", line 778, in load_pretrained_component_from_model raise IOError("Model file not found: {}".format(checkpoint)) OSError: Model file not found: /checkpoint/juancarabina/exps/speech_translation/iwslt_2021/checkpoints/mbart_iwslt_12l_finetune_many2many_large_yun_fix.400000updates.SMPL_temperature.mbart_large.TM_VEPOCH.ls0.1.1workers.adam.inv.lr0.001.warmup8000.initlr1e-07.dr0.3.atdr0.1.actdr0.0.wd0.0.maxtok1024.uf4.seed222.entsrc.det.temp1.5.lnemb.ngpu32/checkpoint_best.pt
It seems it is looking for a mBART checkpoint, which is not available. I have to mention that I just one to load the shared checkpoint for generating the translations, and there is no need for the initializing the model with mBART.
This problem happens since there is a path in the checkpoint that should be overwritten. To do so you can add this line of code to the script:
--model-overrides "{'load_pretrained_mbart_from':'/path/to/mbart.pt'}"
you can download the pre-trained mBART model from here.
After solving those issues, this is the next error I am receiving:
| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-17 20:54:50 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT/spm.model'} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | src-pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | tokenizer: {'bpe': None} 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | 'train' has 0.00% OOV 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | SpeechToTextJointDataset(split="train", n_samples=498, prepend_tgt_lang_tag=False, sh uffle=False, transforms=CompositeAudioFeatureTransform( UtteranceCMVN(norm_means=True, norm_vars=True) SpecAugmentTransform(time_warp_w=0, freq_mask_n=2, freq_mask_f=27, time_mask_n=2, time_mask_t=100, time_mask_p=1.0) ), n_frames_per_step=1 0%| | 0/3 [00:00<?, ?it/s] 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT}/spm.model'} Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 413, in
cli_main() File "./fairseq/fairseq_cli/generate.py", line 409, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 205, in _main constraints=constraints, File "./fairseq/examples/speech_text_joint_to_text/tasks/speech_text_joint_to_text.py", line 220, in inference_step bos_token=self._infer_tgt_lang_id, File "./venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 187, in generate return self._generate(sample, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 254, in _generate encoder_outs = self.model.forward_encoder(net_input) File "./fairseq/fairseq/sequence_generator.py", line 760, in forward_encoder return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/sequence_generator.py", line 760, in return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript return self.forward_non_torchscript(net_input) File "./fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript return self.forward(encoder_input) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputtransformer.py", line 320, in forward src_tokens, src_lengths, return_all_hiddens=return_all_hiddens File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 149, in forward out = self.w2v_encoder.forward(src_tokens, padding_mask, tbc=True) File "./fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 400, in forward res = self.w2v_model.extract_features(w2v_args) File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 701, in extract_features source, padding_mask, mask=mask, features_only=True, layer=layer File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 528, in forward features = self.feature_extractor(source) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "./fairseq/fairseq/models/wav2vec/wav2vec2.py", line 812, in forward x = conv(x) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "./venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "./venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 298, in forward return self._conv_forward(input, self.weight, self.bias) File "./venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 295, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 1, 10], but got 4-dimensional input of size [456, 1, 1732, 80] instead
This problem happens because of an issue in the preprocessing step. The wav2vec encoder receives the audio instead of the filterbanks. With the new preprocessing script in the latest commit this issue has been resolved.
The next problem that I am encountering now is this one:
| src dictionary: 64007 types | tgt dictionary: 64007 types 2021-09-17 20:54:50 | INFO | fairseq_cli.generate | loading model(s) from ${MANIFEST_ROOT}/checkpoint17.pt 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT/spm.model'} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | src-pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:19 | INFO | speech_text_joint_to_text.tasks.speech_text_joint_to_text | tokenizer: {'bpe': None} 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | 'train' has 0.00% OOV 2021-09-17 20:55:20 | INFO | fairseq.data.audio.speech_to_text_dataset | SpeechToTextJointDataset(split="train", n_samples=498, prepend_tgt_lang_tag=False, sh uffle=False, transforms=CompositeAudioFeatureTransform( UtteranceCMVN(norm_means=True, norm_vars=True) SpecAugmentTransform(time_warp_w=0, freq_mask_n=2, freq_mask_f=27, time_mask_n=2, time_mask_t=100, time_mask_p=1.0) ), n_frames_per_step=1 0%| | 0/3 [00:00<?, ?it/s] 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | pre-tokenizer: {'tokenizer': None} 2021-09-17 20:55:22 | INFO | fairseq.tasks.speech_to_text | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': '${MANIFEST_ROOT}/spm.model'} Traceback (most recent call last): File "./fairseq/fairseq_cli/generate.py", line 414, in
cli_main() File "./fairseq/fairseq_cli/generate.py", line 410, in cli_main main(args) File "./fairseq/fairseq_cli/generate.py", line 47, in main return _main(cfg, h) File "./fairseq/fairseq_cli/generate.py", line 206, in _main constraints=constraints, File ".fairseq/examples/speech_text_joint_to_text/tasks/speech_text_joint.py", line 220, in inference_step bos_token=self._infer_tgt_lang_id, File "./venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 187, in generate return self._generate(sample, kwargs) File "./fairseq/fairseq/sequence_generator.py", line 254, in _generate encoder_outs = self.model.forward_encoder(net_input) File "./fairseq/fairseq/sequence_generator.py", line 760, in forward_encoder return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/sequence_generator.py", line 760, in return [model.encoder.forward_torchscript(net_input) for model in self.models] File "./fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript return self.forward_non_torchscript(net_input) File "./fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript return self.forward(*encoder_input) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputtransformer.py", line 320, in forward src_tokens, src_lengths, return_all_hiddens=return_all_hiddens File "./venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "./fairseq/examples/speech_text_joint_to_text/models/s2t_dualinputxmtransformer.py", line 152, in forward if out["encoder_padding_mask"] is not None: KeyError: 'encoder_padding_mask'
š Bug
I am trying to use the model that you shared here to generate translations for the speech that I have. But I am getting this error:
To Reproduce
I am using the same script that is shared here for evaluation.