OpenPecha / tts-model

MIT License
0 stars 0 forks source link

TTS lighter and faster model #1

Open gangagyatso4364 opened 13 hours ago

gangagyatso4364 commented 13 hours ago

Description

The goal is to develop a Tibetan text-to-speech (TTS) model that can convert Tibetan text into Tibetan speech. This project involves training a TTS model using filtered good audio quality from existing speech-to-text (STT) data, adapting it to generate high-quality Tibetan audio efficiently. The model should be lighter, faster, and optimized for resource efficiency compared to previous models. This will involve data preprocessing, model selection, fine-tuning, and performance evaluation to ensure that the TTS model meets quality and speed requirements.

Completion Criteria

  1. A trained TTS model capable of generating Tibetan audio from Tibetan text with clear pronunciation and natural prosody.
  2. The model should demonstrate improved performance (speed and resource efficiency) compared to the previous heavy models used for TTS.
  3. Detailed documentation of the model training process, including data preprocessing steps, model configurations, training parameters, and evaluation metrics.
  4. A comparison report showing the model's performance against previous models in terms of speed, size, and audio quality.
  5. A deployed version of the TTS model for real-time testing and demonstrations.

Implementation

  1. Data Preparation:

    • Convert the existing STT data into a format suitable for TTS training, including segmentation, cleaning, and alignment of text and audio pairs.
    • Generate phoneme transcriptions if needed to improve pronunciation quality during synthesis.
  2. Model Selection and Training:

    • Choose a TTS architecture such as FastSpeech 2, Tacotron 2, or Mozilla TTS that supports customization and lightweight models.
    • Fine-tune a pretrained TTS model on the Tibetan dataset, optimizing for both quality and inference speed.
  3. Evaluation and Optimization:

    • Evaluate the model using metrics like Mean Opinion Score (MOS), speed benchmarks, and comparison to existing models.
    • Optimize the model for faster inference, potentially converting it to a format like ONNX or TensorFlow Lite.
  4. Testing and Deployment:

    • Test the model’s performance on various Tibetan texts to ensure it can handle different scenarios.
    • Deploy the model for real-time testing and gather feedback from native Tibetan speakers to fine-tune further.

Subtasks

gangagyatso4364 commented 13 hours ago

Here’s a list of TTS models that are suitable for fine-tuning on Tibetan data. These models are known for their quality, speed, and ability to handle customization:

1. FastSpeech 2

2. Tacotron 2

3. Mozilla TTS

4. Glow-TTS

5. VITS (Variational Inference TTS)

6. HiFi-GAN (Used as a Vocoder)

Recommendations

gangagyatso4364 commented 13 hours ago

Comparing T5, MMS (Massively Multilingual Speech), and FastSpeech 2 for text-to-speech (TTS) involves understanding their distinct purposes, architecture, and suitability for TTS, especially for a language like Tibetan. Here’s a breakdown of these models and how they compare to FastSpeech 2:

1. FastSpeech 2

2. T5 (Text-to-Text Transfer Transformer)

3. MMS (Massively Multilingual Speech)

Comparison Summary

Recommendation

For Tibetan TTS, FastSpeech 2 remains the best option if your goal is to achieve high-quality, fast, and lightweight text-to-speech conversion. MMS is a good alternative if you're exploring broader speech capabilities or multilingual support, but it might come with additional computational costs.