OpenPecha / tts-model

MIT License
0 stars 0 forks source link

TTS0005: Create a pipeline for MMS TTS model, and train on tts dataset. #2

Open gangagyatso4364 opened 1 month ago

gangagyatso4364 commented 1 month ago

Description

We are going to fine-tune Meta's MMS (Massively Multilingual Speech) model for a Tibetan speaker named Sherab using Sherab's dataset. The process includes preparing Sherab’s data, uploading it to Hugging Face, fine-tuning the MMS model, and creating a Hugging Face Space to check the performance of the fine-tuned model. The selected Speakers:

  1. Sherab
  2. dolkar la and yangchen Experiment:
  3. An experiment on multispeaker using sherab, dolkarla and yangchen audio with different speaker id.

You can test the model on hugging face space given here:

  1. sherab mms tts:
  2. dolkar la and yangchen:

Completion Criteria


Implementation

  1. Data Preparation for Sherab:

  2. Upload Sherab’s Data to Hugging Face:

  3. Fine-Tune the MMS TTS Model:

  4. Create a Hugging Face Space for Model Performance Testing:


Subtasks

1. Data Preparation for Sherab:

2. Upload Sherab’s Data to Hugging Face:

3. Fine-Tune the MMS TTS Model:

4. Create Hugging Face Space for Performance Testing:


gangagyatso4364 commented 1 month ago

Currently the mms model by facebook can be studied here: https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#tts-1

gangagyatso4364 commented 1 month ago

script for finetuning mms : https://github.com/ylacombe/finetune-hf-vits/blob/main/README.md

gangagyatso4364 commented 1 month ago

currently facing issue with speaker ID in the pipeline with Gujurathi dataset, similar case for dalai lama dataset.

gangagyatso4364 commented 1 month ago

need to update TTS data with actual audio instead of url of audio in the dataset. add speaker id for different speaker ids.

gangagyatso4364 commented 1 month ago

ERROR:


Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/finetune-hf-vits/run_vits_finetuning.py", line 1495, in <module>
    main()
  File "/home/ec2-user/SageMaker/finetune-hf-vits/run_vits_finetuning.py", line 1100, in main
    speaker_id=batch["speaker_id"],
  File "/home/ec2-user/SageMaker/finetune-hf-vits/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 270, in __getitem__
    return self.data[item]
KeyError: 'speaker_id'
gangagyatso4364 commented 1 month ago

Currently the pipeline works on single speaker data when i erase the model speaker_id = batch['speaker_id'] line. but for multiple speakers it is not working. setting up the model config to multiple speakers.

gangagyatso4364 commented 1 month ago

need to fine tune the speed of the output audio from text.

gangagyatso4364 commented 1 month ago

Summary: beri gyalse from audio book

Detailed Estimation:

Let me know if you need further clarification or adjustments!

gangagyatso4364 commented 1 month ago

For 912,122 training samples, here is the updated detailed estimation:

Summary:

Detailed Estimation:

gangagyatso4364 commented 1 month ago

There is a issue i found after experimenting in space that my model is not able to generate audio for large text . i need to solve that issue.

gangagyatso4364 commented 1 month ago

Train the mms-tts-bod model on dolkar la and yangchen under same speaker id.

gangagyatso4364 commented 1 month ago

The result of experiment on multispeaker with different id has failed becuase the model learns from all the data but it is not able to differentiate between the speakers due to speaker id issue in model inference.