IIEleven11 / StyleTTS2FineTune

178 stars 32 forks source link

StyleTTS2 Fine-Tuning Guide

This repository provides a guide on how to prepare a dataset and execute fine-tuning using the StyleTTS2 process. https://github.com/yl4579/StyleTTS2

If you still need to curate your dataset. You might want to checkout https://github.com/IIEleven11/Automatic-Audio-Dataset-Maker. At the end you'll need to convert it from .csv to STTSv2's .txt format (train_list.txt and val_list.txt) but that should be easy.

Changelog

Compatibility

The scripts are compatible with WSL2 and Linux. Windows requires additional dependencies and might not be worth the effort.

Setup

Environment Setup

  1. Install conda and activate environment with Python 3.10:
    • conda create --name dataset python==3.10
    • conda activate dataset

Install Pytorch

- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U

Install whisperx/phonemize and segmentation packages

- pip install git+https://github.com/m-bain/whisperx.git
- pip install phonemizer pydub pysrt tqdm

Data Preparation

  1. Change directory to where you have unpacked StyleTTSFineTune (You should see the makeDataset folder)
  2. To make base directories you can run segmenter script. It will create the folders.

    1. run python srtsegmenter.py
  3. Add WAV audio file/s to the audio directory (remove special characters, brackets, parenthesis to prevent issues)
  4. This step isnt mandatory for the training process. You can run whisperx and segmentation without adding silence. If you do want to add silence then silencebuffer.py within the tools folder will go over your audio file, find the silent portions between sentences/breaks in speech, and add a specific length of silence to them. This could in theory provide a more accurate cut during the segmentation process. You MUST adjust the parameters within the script to fit your data. I left the values that worked for my dataset in the code, you can try them as defaults if you wish.
  5. Run the following command to generate srt files for all files in the audio folder:

    • Linux -
      for i in ../audio/*.wav; do whisperx "$i" --model large-v3 --output_format srt --condition_on_previous_text True --max_line_width 250  --max_line_count 1  --segment_resolution sentence  --align_model WAV2VEC2_ASR_LARGE_LV60K_960H; done
    • Windows - in a powershell terminal copy and paste the following after verifying path to audio folder:

      Get-ChildItem -Path 'C:\path\to\wav\folder' -Filter *.wav | ForEach-Object { whisperx $_.FullName --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H }

    This will generate a Whisperx .SRT file transcription of your audio. Place the srt file/s into the srt folder

Segmentation and Transcription

  1. Navigate to the main directory (You should see the folder makeDataset)
  2. Within srtsegmenter.py are some variables to adjust. buffer_time and max_allowed_gap and the final if statement has a desired range you can adjust. You can try to use the defaults I have set, they worked for me. BUT! Theres a chance this will not work out well for your dataset. The process I went through would be to adjust buffer_time then run srtsegmenter.py. Go listen to the segments in order, if they are overlapping, cut mid sentence, or have artifacts then go back and adjust buffer_time. Repeat until you get desired results.
  3. Run the segmentation script (python makeDataset/tools/srtsegmenter.py)
  4. Run the add_padding.py script to add a duration of silence to the end of each audio clip.

The above steps will generate a set of segmented audio files, a folder of bad audio it didn't like, and an output.txt file. I have it set to throw out segmemts under 1 second and over 11.6 seconds. You can adjust this to varying degrees.

At this point you should use the curate.ipynb notebook within this repo. Make a copy of the output.txt file and format it following the outline in the notebook.

Phonemization

  1. Run the script (python makeDataset/tools/phonemized.py --language en-us). The --language argument refers to an espeak-ng voice, such as 'fr-fr' for French (default is en-us). Check the espeak-ng identifier for your language here.
  2. This script will create the train_list.txt and val_list.txt files.
  1. The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts. This is a common oversight in many datasets.
  2. Also it lacks variety of punctuation. In the field, you may encounter texts with creative use of dashes, pauses and combination of quotes and punctuation. LibriTTS lacks those cases. But the model can learn these!
  3. Additionally, LibriTTS has stray quotes in some texts, or begins a sentence with a quote. These things reduce quality a little (or a lot, sometimes). You will want to filter those out.
  4. Creating your own ODD_list.txt is an option. I need to play around with it more, the only real requirements should be good punctuation and that it contains text the model has not seen. I'm not sure what the ideal size of this list should be though.

Fine-Tuning with StyleTTS2

  1. Clone the StyleTTS2 repository and navigate to its directory:

  2. Install the required packages:

    • cd StyleTTS2
    • pip install -r requirements.txt
    • sudo apt-get install espeak-ng
  3. Prepare the data and model:

    • Clear the wavs folder in the data directory and replace with your segmented wav files.
    • Replace the val_list and train_list files in the Data folder with yours. Keep the OOD_list.txt file.
    • Adjust the parameters in the config_ft.yml file in the Configs folder according to your needs.
  4. Download the StyleTTS2-LibriTTS model and place it in the Models/LibriTTS directory.
  5. If the language of your dataset is not English, you will need to modify the PLBER model of StyleTTS. If this is your case, refer to this repository (don't forget to check if your language is supported).

Run

Finally, you can start the fine-tuning process with the following command: