F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
E2 TTS: Flat-UNet Transformer, closest reproduction.
Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance
Clone the repository:
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
Install packages:
pip install -r requirements.txt
Install torch with your CUDA version, e.g. :
pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
Note: install numpy with version < 2.x, e.g. pip install numpy==1.22.0
.
Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in model/dataset.py
.
# prepare custom dataset up to your need
# download corresponding dataset first, and fill in the path in scripts
# Prepare the Emilia dataset
python scripts/prepare_emilia.py
# Prepare the Wenetspeech4TTS dataset
python scripts/prepare_wenetspeech4tts.py
Once your datasets are prepared, you can start the training process.
# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
accelerate config
accelerate launch test_train.py
An initial guidance on Finetuning #57.
To run inference with pretrained models, download the checkpoints from 🤗 Hugging Face.
Currently support up to 30s generation, which is the TOTAL length of prompt audio and the generated. Batch inference with chunks is supported by Gradio APP now.
You can test single inference using the following command. Before running the command, modify the config up to your need.
# modify the config up to your need,
# e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
# nfe_step (larger takes more time to do more precise inference ode)
# ode_method (switch to 'midpoint' for better compatibility with small nfe_step, )
# ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
python test_infer_single.py
To test speech editing capabilities, use the following command.
python test_infer_single_edit.py
You can launch a Gradio app (web interface) to launch a GUI for inference.
First, make sure you have the dependencies installed (pip install -r requirements.txt
). Then, install the Gradio app dependencies:
pip install -r requirements_gradio.txt
After installing the dependencies, launch the app (will load ckpt from Huggingface, you may set ckpt_path
to local file in gradio_app.py
):
python gradio_app.py
You can specify the port/host:
python gradio_app.py --port 7860 --host 0.0.0.0
Or launch a share link:
python gradio_app.py --share
test_infer_batch.py
To run batch inference for evaluations, execute the following commands:
# batch inference for evaluations
accelerate config # if not set before
bash test_infer_batch.sh
Some Notes
For faster-whisper with CUDA 11:
pip install --force-reinstall ctranslate2==3.24.0
(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
pip install faster-whisper==0.10.1
Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
# Evaluation for Seed-TTS test set
python scripts/eval_seedtts_testset.py
# Evaluation for LibriSpeech-PC test-clean (cross-sentence)
python scripts/eval_librispeech_test_clean.py
@article{chen-etal-2024-f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
journal={arXiv preprint arXiv:2410.06885},
year={2024},
}
Our code is released under MIT License.