TTS lighter and faster model

Description

The goal is to develop a Tibetan text-to-speech (TTS) model that can convert Tibetan text into Tibetan speech. This project involves training a TTS model using filtered good audio quality from existing speech-to-text (STT) data, adapting it to generate high-quality Tibetan audio efficiently. The model should be lighter, faster, and optimized for resource efficiency compared to previous models. This will involve data preprocessing, model selection, fine-tuning, and performance evaluation to ensure that the TTS model meets quality and speed requirements.

Completion Criteria

A trained TTS model capable of generating Tibetan audio from Tibetan text with clear pronunciation and natural prosody.
The model should demonstrate improved performance (speed and resource efficiency) compared to the previous heavy models used for TTS.
Detailed documentation of the model training process, including data preprocessing steps, model configurations, training parameters, and evaluation metrics.
A comparison report showing the model's performance against previous models in terms of speed, size, and audio quality.
A deployed version of the TTS model for real-time testing and demonstrations.

Implementation

Data Preparation:
- Convert the existing STT data into a format suitable for TTS training, including segmentation, cleaning, and alignment of text and audio pairs.
- Generate phoneme transcriptions if needed to improve pronunciation quality during synthesis.
Model Selection and Training:
- Choose a TTS architecture such as FastSpeech 2, Tacotron 2, or Mozilla TTS that supports customization and lightweight models.
- Fine-tune a pretrained TTS model on the Tibetan dataset, optimizing for both quality and inference speed.
Evaluation and Optimization:
- Evaluate the model using metrics like Mean Opinion Score (MOS), speed benchmarks, and comparison to existing models.
- Optimize the model for faster inference, potentially converting it to a format like ONNX or TensorFlow Lite.
Testing and Deployment:
- Test the model’s performance on various Tibetan texts to ensure it can handle different scenarios.
- Deploy the model for real-time testing and gather feedback from native Tibetan speakers to fine-tune further.

Subtasks

[ ] Data Preparation:
- [ ] Clean and preprocess the Tibetan text and audio data.
- [ ] Split the dataset into training, validation, and test sets.
- [ ] Optionally convert text to phonemes to enhance pronunciation quality.
[ ] Model Training:
- [ ] Select a suitable TTS architecture (e.g., FastSpeech 2, Tacotron 2).
- [ ] Set up the training environment with required libraries and dependencies.
- [ ] Fine-tune the model using the prepared Tibetan dataset.
- [ ] Monitor training metrics and adjust hyperparameters as needed.
[ ] Model Evaluation:
- [ ] Evaluate the model using test data to assess audio quality and synthesis speed.
- [ ] Compare performance metrics with previous TTS models used.
[ ] Model Optimization:
- [ ] Optimize the model for faster inference (e.g., by converting to ONNX).
- [ ] Reduce the model size if necessary without compromising quality.
[ ] Testing and Deployment:
- [ ] Deploy the model for real-time use and collect feedback.
- [ ] Perform extensive testing with various Tibetan text inputs to ensure stability and quality.
[ ] Documentation:
- [ ] Document the entire training process, including data preparation, training configurations, and evaluation results.
- [ ] Prepare a report comparing the new model with previous versions.

Here’s a list of TTS models that are suitable for fine-tuning on Tibetan data. These models are known for their quality, speed, and ability to handle customization:

1. FastSpeech 2

Description: A fast and robust end-to-end TTS model known for generating high-quality speech with low latency.
Pros:
- High synthesis speed, making it suitable for real-time applications.
- Handles complex prosody and natural intonation well.
- Lighter and faster than many traditional models.
Cons:
- May require phoneme-level inputs for best performance.
Use Case: Fine-tuning for specific languages, including Tibetan, where speed and quality are important.

2. Tacotron 2

Description: A widely used TTS model that produces natural-sounding speech by converting text into mel spectrograms, which are then converted to audio.
Pros:
- High-quality, natural prosody, and voice expression.
- Well-documented and supported by the TTS community.
Cons:
- Slower inference compared to models like FastSpeech 2.
- Requires careful tuning to handle low-resource languages effectively.
Use Case: Suitable for languages where expressive and natural speech synthesis is prioritized.

3. Mozilla TTS

Description: An open-source TTS project offering customizable models, including Tacotron 2, FastSpeech 2, and Glow-TTS variations.
Pros:
- Flexible with various architecture choices and easy to fine-tune.
- Good community support and documentation.
- Actively maintained with ongoing improvements.
Cons:
- Setup may require some experience with deep learning frameworks.
Use Case: A great option for fine-tuning with custom datasets, offering a variety of architectures.

4. Glow-TTS

Description: A non-autoregressive TTS model that generates high-quality speech with efficient training and inference speed.
Pros:
- Fast inference with good audio quality.
- Non-autoregressive nature reduces computational overhead.
Cons:
- Slightly less expressive than Tacotron 2 in some cases.
Use Case: Best for applications requiring efficient, quick speech synthesis.

5. VITS (Variational Inference TTS)

Description: Combines the best of autoregressive and non-autoregressive models, allowing for high-quality and fast speech synthesis.
Pros:
- Produces highly natural and expressive speech.
- Faster inference compared to traditional autoregressive models.
Cons:
- More complex architecture, requiring careful tuning.
Use Case: Suitable for scenarios where both quality and speed are needed, with a bit more setup complexity.

6. HiFi-GAN (Used as a Vocoder)

Description: A neural vocoder often used alongside TTS models like Tacotron 2 and FastSpeech 2 to convert spectrograms to audio.
Pros:
- High fidelity and fast audio generation.
- Can be used to improve audio quality when paired with other TTS models.
Cons:
- Only acts as a vocoder, not a complete TTS model.
Use Case: Integrate with models like FastSpeech 2 for enhanced audio quality.

Recommendations

For Speed and Efficiency: FastSpeech 2 and Glow-TTS are excellent choices due to their non-autoregressive nature, which ensures faster inference.
For Quality and Expressiveness: Tacotron 2, especially when paired with HiFi-GAN, offers superior expressiveness but might be slower.
For Flexibility and Community Support: Mozilla TTS is a great all-around choice due to its open-source nature and the availability of multiple architectures.

Comparing T5, MMS (Massively Multilingual Speech), and FastSpeech 2 for text-to-speech (TTS) involves understanding their distinct purposes, architecture, and suitability for TTS, especially for a language like Tibetan. Here’s a breakdown of these models and how they compare to FastSpeech 2:

1. FastSpeech 2

Purpose: Specifically designed for TTS.
Architecture: Non-autoregressive Transformer-based model that generates mel-spectrograms from input text, which are then converted into audio using a vocoder like HiFi-GAN.
Strengths:
- High-quality, natural-sounding speech with complex prosody.
- Very fast inference due to its non-autoregressive nature, making it suitable for real-time TTS.
- Optimized for TTS tasks and can be fine-tuned easily on specific languages, including low-resource ones like Tibetan.
Weaknesses:
- Requires text-audio pairs and sometimes phoneme-level input for best performance.
- Not designed for general language understanding or multitask capabilities.
Suitability for Tibetan TTS: Excellent, as it directly addresses TTS needs with customizable training, and it’s relatively light and fast.

2. T5 (Text-to-Text Transfer Transformer)

Purpose: A general-purpose sequence-to-sequence model for various NLP tasks, including text generation, translation, summarization, and more.
Architecture: Transformer-based model trained on a vast amount of text data, handling multiple languages and tasks in a unified "text-to-text" framework.
Strengths:
- Highly versatile and can be adapted for a wide range of text-based tasks.
- Strong language understanding capabilities, including multilingual text processing.
Weaknesses:
- Not designed for TTS; it doesn't generate audio or handle the direct conversion of text to speech.
- Requires additional models or adaptations (like separate vocoders) to handle speech synthesis, which is not its core functionality.
Suitability for Tibetan TTS: Not directly suitable. While T5 can handle Tibetan text processing, it’s not inherently designed for speech generation. To use T5 for TTS, a separate vocoder or a secondary model would be needed, complicating the pipeline and reducing efficiency compared to FastSpeech 2.

3. MMS (Massively Multilingual Speech)

Purpose: A speech model developed by Meta AI that supports a wide range of languages, including low-resource and endangered ones, for speech recognition, synthesis, and translation tasks.
Architecture: Uses large-scale multilingual training data and advanced Transformer models to handle diverse speech-related tasks in many languages.
Strengths:
- Highly adaptable to various languages, including Tibetan, thanks to its multilingual training approach.
- Can perform both speech-to-text and text-to-speech, making it a versatile tool for speech tasks.
- Fine-tuning capabilities allow adaptation to specific accents, dialects, or speaker-specific data.
Weaknesses:
- Larger and more resource-intensive compared to FastSpeech 2; may require significant computational power for training and inference.
- Potentially slower than models specifically optimized for TTS like FastSpeech 2.
Suitability for Tibetan TTS: Very promising, especially due to its focus on multilingual speech. It can handle Tibetan if adapted correctly and offers a broad range of features for speech applications. However, it might not be as optimized for fast, lightweight TTS as FastSpeech 2.

Comparison Summary

FastSpeech 2: Best choice for dedicated TTS applications where speed, efficiency, and audio quality are critical. Ideal for real-time speech generation and fine-tuning on Tibetan data.
T5: Not suitable for TTS tasks as it is primarily a text-to-text model without native audio generation capabilities. Better suited for language processing, translation, and text-based tasks.
MMS: A strong alternative for TTS, especially in multilingual and low-resource language contexts like Tibetan. Offers comprehensive speech capabilities but may not match the speed and lightweight nature of FastSpeech 2.

Recommendation

For Tibetan TTS, FastSpeech 2 remains the best option if your goal is to achieve high-quality, fast, and lightweight text-to-speech conversion. MMS is a good alternative if you're exploring broader speech capabilities or multilingual support, but it might come with additional computational costs.

OpenPecha / tts-model