SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.35k stars 881 forks source link

Fine-tuning F5-TTS with BigVGAN checkpoint and vocoder #513

Open Mustaphajudi opened 6 hours ago

Mustaphajudi commented 6 hours ago

Checks

Question details

Hi @SWivid ,

I'm trying to fine-tune F5-TTS using the provided BigVGAN checkpoint and vocoder. I've followed the instructions in the README regarding setting up the BigVGAN submodule, but I'm unsure about the specific code modifications needed for both fine-tuning and inference.

Could you please provide more detailed guidance on the following:

What changes are required in the training scripts (train.py, finetune_cli.py) to use the BigVGAN vocoder and a BigVGAN-trained checkpoint? I'm particularly interested in how to correctly configure the mel spectrogram generation and handle the data type (FP32) requirements of BigVGAN. For example, should I pass mel_spec_type="bigvgan" to both the CFM model and the Trainer?

Are there any adjustments needed in the model definition files (cfm.py, modules.py) for BigVGAN compatibility during training?

Similarly, what changes are necessary in the inference scripts (infer_cli.py, infer_gradio.py, and utils_infer.py) to use BigVGAN for audio generation after fine-tuning?

Could you also elaborate on the advantages and disadvantages of using BigVGAN compared to Vocos for F5-TTS? For instance, are there differences in terms of:

Audio quality (naturalness, clarity)?

Computational cost (training time, inference speed, memory usage)?

Model size?

Ease of use/setup?

SWivid commented 6 hours ago

Audio quality (naturalness, clarity)?

Computational cost (training time, inference speed, memory usage)?

Model size?

Ease of use/setup?

  1. BigVGAN slightly better in clarity, Vocos slightly better in naturalness.
  2. We use pretrained Vocoder. The vocoder training is separate with TTS model training.
  3. Same as 2.
  4. Vocos is currently easier and has smaller model size (refer to the params of vocoder).

If you are interested with BigVGAN training and further questions, @ZhikangNiu might help. Thought just mel_spec_type="bigvgan" passed in is fine for training.

Mustaphajudi commented 6 hours ago

Audio quality (naturalness, clarity)? Computational cost (training time, inference speed, memory usage)? Model size? Ease of use/setup?

  1. BigVGAN slightly better in clarity, Vocos slightly better in naturalness.
  2. We use pretrained Vocoder. The vocoder training is separate with TTS model training.
  3. Same as 2.
  4. Vocos is currently easier and has smaller model size (refer to the params of vocoder).

If you are interested with BigVGAN training and further questions, @ZhikangNiu might help. Thought just mel_spec_type="bigvgan" passed in is fine for training.

Ok waiting @ZhikangNiu for more informations about fine tunning f5 with bigvgan. @SWivid for inference,i change mel_spec_type = "vocos" to bigvgan in utils_infer.py and in def load_vocoder(vocoder_name="bigvgan", is_local=False, local_path="", device=device): but i got error,notice i downloaded the f5 bigvgan checkpoint,maybe i miss something or what?

here is the error: (venv) C:\newtts\F5-TTS>f5-tts_infer-gradio You need to follow the README to init submodule and change the BigVGAN source code. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in run_code File "C:\newtts\F5-TTS\venv\Scripts\f5-tts_infer-gradio.exe_main.py", line 4, in File "C:\newtts\F5-TTS\src\f5_tts\infer\infer_gradio.py", line 40, in vocoder = load_vocoder() ^^^^^^^^^^^^^^ File "C:\newtts\F5-TTS\src\f5_tts\infer\utils_infer.py", line 116, in load_vocoder vocoder = bigvgan.BigVGAN.from_pretrained("nvidia/bigvgan_v2_24khz_100band_256x", use_cuda_kernel=False) ^^^^^^^ UnboundLocalError: cannot access local variable 'bigvgan' where it is not associated with a value

SWivid commented 6 hours ago

You need to follow the README to init submodule and change the BigVGAN source code.

As mentioned in error output, need to check the readme copy and paste the code to corresponding place

Mustaphajudi commented 5 hours ago

You need to follow the README to init submodule and change the BigVGAN source code.

As mentioned in error output, need to check the readme copy and paste the code to corresponding place

i do all the steps mentioned in readme but same error:```bash git clone https://github.com/JarodMica/F5-TTS.git cd F5-TTS

I am using venv at py 3.11

py -3.11 -m venv venv venv\Scripts\activate

git submodule update --init --recursive # (optional, if need bigvgan)

pip install -e .

If you initialize submodule, you should add the following code at the beginning of `src/third_party/BigVGAN/bigvgan.py`.
```python
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
SWivid commented 5 hours ago

load_vocoder(vocoder_name="bigvgan", is_local=False, local_path="", device=device): but i got error,notice i downloaded the f5

so have you pass in the path of local bigvgan dir and turn on is_local to True?

you may also comment out https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L117-L120 and just put from third_party.BigVGAN import bigvgan to see what specific problem encountered while importing

Mustaphajudi commented 5 hours ago

so have you pass in the path of local bigvgan dir and turn on is_local to True?

here is updated function code:

load vocoder

def load_vocoder(vocoder_name="bigvgan", is_local=True, local_path="C:/newtts/F5-TTS/ckpts/F5TTS_Base_bigvgan/model_1250000.pt", device=device): if vocoder_name == "vocos": if is_local: print(f"Load vocos from local path {local_path}") repo_id = "charactr/vocos-mel-24khz" revision = None config_path = hf_hub_download(repo_id=repo_id, cache_dir=local_path, filename="config.yaml", revision=revision) model_path = hf_hub_download(repo_id=repo_id, cache_dir=local_path, filename="pytorch_model.bin", revision=revision) vocoder = Vocos.from_hparams(config_path=config_path) state_dict = torch.load(model_path, map_location="cpu") vocoder.load_state_dict(state_dict) vocoder = vocoder.eval().to(device) else: print("Download Vocos from huggingface charactr/vocos-mel-24khz") vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device) elif vocoder_name == "bigvgan":

try:

    from third_party.BigVGAN import bigvgan
    # except ImportError:
    #     print("You need to follow the README to init submodule and change the BigVGAN source code.")
    if is_local:
        """download from https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x/tree/main"""
        vocoder = bigvgan.BigVGAN.from_pretrained(local_path, use_cuda_kernel=False)
    else:
        vocoder = bigvgan.BigVGAN.from_pretrained("nvidia/bigvgan_v2_24khz_100band_256x", use_cuda_kernel=False)

    vocoder.remove_weight_norm()
    vocoder = vocoder.eval().to(device)
return vocoder

and here the error: (venv) C:\newtts\F5-TTS>f5-tts_infer-gradio Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\newtts\F5-TTS\venv\Scripts\f5-tts_infer-gradio.exe__main__.py", line 4, in File "C:\newtts\F5-TTS\src\f5_tts\infer\infer_gradio.py", line 40, in vocoder = load_vocoder() ^^^^^^^^^^^^^^ File "C:\newtts\F5-TTS\src\f5_tts\infer\utils_infer.py", line 109, in load_vocoder from third_party.BigVGAN import bigvgan File "C:\newtts\F5-TTS\src\third_party\BigVGAN\bigvgan.py", line 19, in import activations ModuleNotFoundError: No module named 'activations'

(venv) C:\newtts\F5-TTS>

SWivid commented 4 hours ago

def load_vocoder(vocoder_name="bigvgan", is_local=True, local_path="C:/newtts/F5-TTS/ckpts/F5TTS_Base_bigvgan/model_1250000.pt", device=device):

the local_path for load_vocoder is the path of vocoder

if is_local: """download from https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x/tree/main"""

so need to be set as "xxxxx/xxxx/bigvgan_v2_24khz_100band_256x/"


so i may got you wrong, you meant downloaded the f5 bigvgan checkpoint the tts ckpt rather than vocoder, and are able to directly connect huggingface to pull the vocoder then could just simply use the original load_vocoder(vocoder_name="bigvgan", is_local=False, local_path="", device=device)

and see how is the output error

Mustaphajudi commented 3 hours ago

from third_party.BigVGAN import bigvgan

still same error,mybe BigVGAN need diffrent python version less than 3.11 ? i run bigvgan gradio fine it work,but when i try to use it via f5 inference it show me the error of : File "C:\newtts\F5-TTS\src\f5_tts\infer\utils_infer.py", line 109, in load_vocoder from third_party.BigVGAN import bigvgan File "C:\newtts\F5-TTS\src\third_party\BigVGAN\bigvgan.py", line 19, in import activations ModuleNotFoundError: No module named 'activations'

SWivid commented 3 hours ago

maybe you could provide the files you used for us in a zip?

ModuleNotFoundError: No module named 'activations'

how is your bigvgan.py e.g.