SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.48k stars 924 forks source link

How to learn new datasets additionally with my fine-tuned model? #432

Closed airar-dev closed 4 days ago

airar-dev commented 2 weeks ago

Checks

Environment Details

runpod 1 x H100 SXM 24 vCPU 251 GB RAM Python and cuda F5-TTS default setting

Steps to Reproduce

my method

  1. my pretrained model : model_last.pt

  2. add new dataset

  3. finetune_gradio.py

    1. new project

    2. Transcribe Data

    3. Vocab Check

    4. Prepare Data

    5. Train Data

    Path to the Pretrained Checkpoint

    • /workspace/F5-TTS/ckpts/my_new_dataset/model_last.pt
    1. Start Training

✔️ Expected Behavior

Learning with an additional dataset to fine-tuned model_last.pt based on f5-ttsbase

❌ Actual Behavior

I made my own fine-tuning model with the F5-TTS base model

I want to learn additionally to my own fine-tuning model, but the following error occurs.

error

RuntimeError: Error(s) in loading state_dict for EMA: size mismatch for ema_model.transformer.text_embed.text_embed.weight: copying a param with shape torch.Size


Maybe it's a problem with using vocab.txt

What kind of vocab.txt should I use?

Or is it another matter?

I need a guide to further learn my own fine-tuned model, not f5-tts base.

Of course, my initial own model is one fine-tuned to the F5-tts base.

Thank you.

SWivid commented 2 weeks ago

my initial own model is one fine-tuned to the F5-tts base.

use same vocab.txt as the fined-tuned. or extend the embed weights leveraging finetune-gradio, see training readme

airar-dev commented 2 weeks ago

Thank you for your reply

image

Is the above image process correct?

I'm doing it the way above, but I still get the same error as below.

RuntimeError: Error(s) in loading state_dict for EMA: size mismatch for ema_model.transformer.text_embed.text_embed.weight: copying a param with shape torch.Size

What am I missing?

I'd really appreciate it if you could give me a guide

thank you again.!

I initially fine-tuned the F5-TTS base model with 50 hours of data in my language. I plan to add another 1000 hours of training data. Due to GPU limitations, I will train in increments of 50 hours. Ultimately, my goal is to create a fully fine-tuned model_last.pt after completing the entire 1000 hours of training.

SWivid commented 2 weeks ago

the pretrained model needs be extend along with vocab, see https://github.com/SWivid/F5-TTS/blob/3fcdbc70b4a9d4299e1ecd0b5a1c35209f23fd69/src/f5_tts/train/finetune_gradio.py#L1059-L1115 in which text embed weight is extended also https://github.com/SWivid/F5-TTS/blob/3fcdbc70b4a9d4299e1ecd0b5a1c35209f23fd69/src/f5_tts/train/finetune_gradio.py#L1112

airar-dev commented 2 weeks ago

Thank you!

I am in the process of testing again by modifying finetune_gradio.py.

Thank you so much. I'll share the test results.

Thank you!! again!

SWivid commented 4 days ago

will close this issue, feel free to open if further questions