facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.97k stars 1.07k forks source link

Finetuning in Spanish #203

Open Evening2k opened 1 year ago

Evening2k commented 1 year ago

When downloading the HuggingFace datastes hub dataset in Spanish and then passing the .json of the train and eval datasets as a parameter, there is an error that "es" is not in the list of languages.

!python /content/seamless_communication/scripts/m4t/finetune/finetune.py --train_dataset /content/drive/MyDrive/GPI/Finetuning/train_manifest.json --eval_dataset /content/drive/MyDrive/GPI/hola/validation_manifest.json --save_model_to /content/drive/MyDrive/GPI/Modelo_finetuneado.pth --mode SPEECH_TO_TEXT --model_name seamlessM4T_medium --batch_size 1 --max_epoch 8

ERROR:

Traceback (most recent call last): File "/content/seamless_communication/scripts/m4t/finetune/finetune.py", line 183, in main() File "/content/seamless_communication/scripts/m4t/finetune/finetune.py", line 179, in main finetune.run() File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/trainer.py", line 350, in run self._eval_model() File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/trainer.py", line 297, in _eval_model for batch in self.eval_data_loader.get_dataloader(): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 176, in _prepare_batch text_tokens_list = [ File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 177, in self._get_tokenized_target_text(sample) for sample in samples File "/usr/local/lib/python3.10/dist-packages/m4t_scripts/finetune/dataloader.py", line 133, in _get_tokenized_target_text ] = self.text_tokenizer.create_encoder(lang=target_lang, mode="target") File "/usr/local/lib/python3.10/dist-packages/fairseq2/models/nllb/tokenizer.py", line 94, in create_encoder raise ValueError( ValueError: lang must be a supported language, but is 'es' instead.

JRunner97 commented 11 months ago

es isn't the right language code for spanish. See the README: https://github.com/facebookresearch/seamless_communication/blob/5807362d1414099cbf0a3303f720d8734a052ca6/docs/m4t/README.md

and LANGUAGE_CODE_TO_NAME dict: https://github.com/facebookresearch/seamless_communication/blob/5807362d1414099cbf0a3303f720d8734a052ca6/demo/expressive/utils.py#L1

MuhammadWaqarSahi commented 9 months ago

@Evening2k can you share notebook or code in which you finetune speech to text task