Persian Tacotron2 is a customized implementation of Tacotron2, adapted for Persian text-to-speech (TTS) synthesis. Tacotron2 is a model that converts text into mel-spectrograms, which can then be synthesized into audio. This implementation builds upon NVIDIA's Tacotron2 with adjustments for Persian phoneme-based data.
To adapt Tacotron2 for Persian, the following changes were made:
cleaner.py
in tacotron2/text/
to handle Persian phonemes.hparams.py
in tacotron2/
for Persian language data.Clone the Repository
git clone https://github.com/your_username/persian_tacotron.git
cd persian_tacotron
Install Requirements
pip install -r tacotron2/requirements.txt
Prepare Your Data
files/phoneme_transcriptions.txt
Create Data Files Run the data preparation script:
python create_data_file.py
This will generate text files in files/text_files/
. Move these files to tacotron2/filelists/
for training.
Configure Hyperparameters
Modify hparams.py in tacotron2/
to set parameters like epochs, iters_per_checkpoint, training_files, and validation_files paths.
Start Training Begin training with:
python tacotron2/train.py --output_directory=outdir --log_directory=logdir
Checkpoints are saved in tacotron2/outdir/
. For instance, with 1000 audio files and a batch size of 16, each epoch will include approximately 1000/16 iterations. If you encounter memory issues, reduce the batch_size in hparams.py.
Test the Model
Update get_results.py
with the phoneme sequence you’d like to test (text = "YOUR_TEST_PHONEME").
Run inference with the latest checkpoint. For example:
python get_results.py 32000
Outputs (mel-spectrograms and audio files) will be saved in results/.
Training the model on 2500 audio files for 400 epochs produced the following results:
Click here for sample audio results.