MithrilMan / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
0 stars 0 forks source link

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

python arXiv demo hfspace msspace lab

Watermark

F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.

E2 TTS: Flat-UNet Transformer, closest reproduction from paper.

Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance

Thanks to all the contributors !

News

Installation

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts

# Install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Then you can choose from a few options below:

1. As a pip package (if just for inference)

pip install git+https://github.com/SWivid/F5-TTS.git

2. Local editable (if also do training, finetuning)

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive  # (optional, if need bigvgan)
pip install -e .

If initialize submodule, you should add the following code at the beginning of src/third_party/BigVGAN/bigvgan.py.

import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

3. Docker usage

# Build from Dockerfile
docker build -t f5tts:v1 .

# Or pull from GitHub Container Registry
docker pull ghcr.io/swivid/f5-tts:main

Inference

1. Gradio App

Currently supported features:

# Launch a Gradio app (web interface)
f5-tts_infer-gradio

# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# Launch a share link
f5-tts_infer-gradio --share

2. CLI Inference

# Run with flags
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
f5-tts_infer-cli -c custom.toml

# Multi voice. See src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml

3. More instructions

Training

1. Gradio App

Read training & finetuning guidance for more instructions.

# Quick start with Gradio web interface
f5-tts_finetune-gradio

Evaluation

Development

Use pre-commit to ensure code quality (will run linters and formatters automatically)

pip install pre-commit
pre-commit install

When making a pull request, before each commit, run:

pre-commit run --all-files

Note: Some model components have linting exceptions for E722 to accommodate tensor notation

Acknowledgements

Citation

If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.