TTS0005: Create a pipeline for MMS TTS model, and train on tts dataset.

gangagyatso4364 commented 1 month ago

Description

We are going to fine-tune Meta's MMS (Massively Multilingual Speech) model for a Tibetan speaker named Sherab using Sherab's dataset. The process includes preparing Sherab’s data, uploading it to Hugging Face, fine-tuning the MMS model, and creating a Hugging Face Space to check the performance of the fine-tuned model. The selected Speakers:

Sherab
dolkar la and yangchen Experiment:
An experiment on multispeaker using sherab, dolkarla and yangchen audio with different speaker id.

You can test the model on hugging face space given here:

Completion Criteria

Successfully upload Sherab’s dataset to Hugging Face.
A fine-tuned MMS TTS model for Sherab that converts Tibetan text into high-quality, expressive speech.
Creation of a Hugging Face Space to test and showcase the model’s performance.

Implementation

Data Preparation for Sherab:
Upload Sherab’s Data to Hugging Face:
Fine-Tune the MMS TTS Model:
Create a Hugging Face Space for Model Performance Testing:

Subtasks

1. Data Preparation for Sherab:

[x] Prepare Sherab’s dataset (Tibetan text and audio) in the required format for MMS model input.

2. Upload Sherab’s Data to Hugging Face:

[x] Create a dataset repository for Sherab’s data on Hugging Face.
[x] Upload the processed Sherab dataset to Hugging Face.

3. Fine-Tune the MMS TTS Model:

[x] Set up the fine-tuning environment and configuration for the MMS model.
[x] Fine-tune the MMS TTS model using Sherab’s Tibetan dataset.
[x] Conduct subjective evaluations with native Tibetan speakers for quality assessment.

4. Create Hugging Face Space for Performance Testing:

[x] Create a Hugging Face Space for real-time or near-real-time TTS generation.
[x] Ensure users can test the model by inputting Tibetan text and listening to the generated speech.

gangagyatso4364 commented 1 month ago

Currently the mms model by facebook can be studied here: https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#tts-1

gangagyatso4364 commented 1 month ago

script for finetuning mms : https://github.com/ylacombe/finetune-hf-vits/blob/main/README.md

gangagyatso4364 commented 1 month ago

currently facing issue with speaker ID in the pipeline with Gujurathi dataset, similar case for dalai lama dataset.

gangagyatso4364 commented 1 month ago

need to update TTS data with actual audio instead of url of audio in the dataset. add speaker id for different speaker ids.

gangagyatso4364 commented 1 month ago

ERROR:


Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/finetune-hf-vits/run_vits_finetuning.py", line 1495, in <module>
    main()
  File "/home/ec2-user/SageMaker/finetune-hf-vits/run_vits_finetuning.py", line 1100, in main
    speaker_id=batch["speaker_id"],
  File "/home/ec2-user/SageMaker/finetune-hf-vits/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 270, in __getitem__
    return self.data[item]
KeyError: 'speaker_id'

gangagyatso4364 commented 1 month ago

Currently the pipeline works on single speaker data when i erase the model speaker_id = batch['speaker_id'] line. but for multiple speakers it is not working. setting up the model config to multiple speakers.

gangagyatso4364 commented 1 month ago

need to fine tune the speed of the output audio from text.

gangagyatso4364 commented 1 month ago

Summary: beri gyalse from audio book

Detailed Estimation:

Total Training Samples: 74,000
- Total Training Samples hours: 78 hours
Number of Steps per Epoch: 4,625 steps
Total Number of Steps for 50 Epochs: 231,250 steps
Estimated Training Time: ~38.54 hours
Estimated Cost: ~$57.81

Let me know if you need further clarification or adjustments!

gangagyatso4364 commented 1 month ago

For 912,122 training samples, here is the updated detailed estimation:

Summary:

Detailed Estimation:

Total Training Samples: 912,122
Number of Steps per Epoch: ~57,007.64 steps
Total Number of Steps for 50 Epochs: ~2,850,381 steps
Estimated Training Time: ~475.06 hours
Estimated Cost: ~$712.60

gangagyatso4364 commented 1 month ago

There is a issue i found after experimenting in space that my model is not able to generate audio for large text . i need to solve that issue.

gangagyatso4364 commented 1 month ago

Train the mms-tts-bod model on dolkar la and yangchen under same speaker id.

gangagyatso4364 commented 1 month ago

The result of experiment on multispeaker with different id has failed becuase the model learns from all the data but it is not able to differentiate between the speakers due to speaker id issue in model inference.

OpenPecha / tts-model