Open Evening2k opened 1 year ago
es isn't the right language code for spanish. See the README: https://github.com/facebookresearch/seamless_communication/blob/5807362d1414099cbf0a3303f720d8734a052ca6/docs/m4t/README.md
and LANGUAGE_CODE_TO_NAME dict: https://github.com/facebookresearch/seamless_communication/blob/5807362d1414099cbf0a3303f720d8734a052ca6/demo/expressive/utils.py#L1
@Evening2k can you share notebook or code in which you finetune speech to text task
When downloading the HuggingFace datastes hub dataset in Spanish and then passing the .json of the train and eval datasets as a parameter, there is an error that "es" is not in the list of languages.
!python /content/seamless_communication/scripts/m4t/finetune/finetune.py --train_dataset /content/drive/MyDrive/GPI/Finetuning/train_manifest.json --eval_dataset /content/drive/MyDrive/GPI/hola/validation_manifest.json --save_model_to /content/drive/MyDrive/GPI/Modelo_finetuneado.pth --mode SPEECH_TO_TEXT --model_name seamlessM4T_medium --batch_size 1 --max_epoch 8
ERROR:
Traceback (most recent call last): File "/content/seamless_communication/scripts/m4t/finetune/finetune.py", line 183, in
main()
File "/content/seamless_communication/scripts/m4t/finetune/finetune.py", line 179, in main
finetune.run()
File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/trainer.py", line 350, in run
self._eval_model()
File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/trainer.py", line 297, in _eval_model
for batch in self.eval_data_loader.get_dataloader():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 176, in _prepare_batch
text_tokens_list = [
File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 177, in
self._get_tokenized_target_text(sample) for sample in samples
File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 133, in _get_tokenized_target_text
] = self.text_tokenizer.create_encoder(lang=target_lang, mode="target")
File "/usr/local/lib/python3.10/dist-packages/fairseq2/models/nllb/tokenizer.py", line 94, in create_encoder
raise ValueError(
ValueError: lang must be a supported language, but is 'es' instead.