SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.68k stars 960 forks source link

Training from scratch for English model, somewhat listenable after 180k updates. #548

Open Quentin1168 opened 4 days ago

Quentin1168 commented 4 days ago

Checks

Environment Details

pytorch:2.4.0 python:3.11 cuda:12.4 ubuntu:22.04

Steps to Reproduce

I am trying to train a English model from scratch. I have ~400 hours of audio data from the Libri TTS R dataset. Combining the dev_clean, test_clean and train_clean_360 data together.

I am currently renting a Single A100 SXM 80 GB vram GPU.

Regarding Training Params: Initial LR: 0.000075 Batch Size is 38400 on 1 GPU Batch Size Type is Frame. Max Samples is 64. Gradient Acumulation Steps is 1 Max Gradient Norm is 1 Warmup Updates is 10k Using Bf16 precision.

✔️ Expected Behavior

Having seen some other benchmarks using way smaller datasets and different batch sizes, I expected my results to be better than what I currently have.

The audio is jumbled and gibberish in most parts with some intelligble phrases being heard clearly, mostly at the end.

An example: Text to be generated. "How is this still doing this badly? I don't know what is going on. I told my boss I'll get this to him by Monday but I don't think it's going to happen." Result: https://voca.ro/19IKC1qVps3k

This is the loss and lr graphs. Screenshot 2024-11-29 024328 image

Would I need to train longer? I am currently at 190k steps and still training. The loss has stopped going down for a few hours, would that mean my model will cease improving? I am new to this, and only was able to compile this training setup through seeing the guides here. I would greatly appreciate any help regarding to this issue.

❌ Actual Behavior

No response

danielw97 commented 4 days ago

I'm by no means an expert and it may be best to hear from the project authors, whom I'm sure will also chime in. This model from my reading requires quite a bit of training initially, particularly if you're not finetuning. As someone who's also interested in training from scratch, I'm interested how you get on with this. Issue #509 is where I got a lot of the info on the amount of training required. Hope this helps a bit.

SWivid commented 4 days ago

the loss is fine, it's normal after warmup and several k steps, it gets into a phase that just decreases at a very slow rate

we haven't tried libritts-r, but tried libritts

trained with small model (~155M), 38400 * 8GPU 200K to have pretty good results, 500K is OK to stop with impressive performance on English test set (with a certain zero-shot ability but obviously inferior to our base mode trained with more diverse data).

so, just train longer, will be better

Quentin1168 commented 4 days ago

the loss is fine, it's normal after warmup and several k steps, it gets into a phase that just decreases at a very slow rate

we haven't tried libritts-r, but tried libritts

trained with small model (~155M), 38400 * 8GPU 200K to have pretty good results, 500K is OK to stop with impressive performance on English test set (with a certain zero-shot ability but obviously inferior to our base mode trained with more diverse data).

so, just train longer, will be better

Hey, thank you so much for your reply. How would I change the size of my model? I have not touched anything regarding the model size so I assume I am training on the large one. Thank you.

SWivid commented 4 days ago

If with the latest version check the readme for training, simply leverage Hydra config yamls

Otherwise in the older version, is to change the hyperparams in train.py, we have mentioned small model config in our paper, and you can also find in under scripts/ the file including model params

ZhikangNiu commented 4 days ago

@Quentin1168 Based on my experimental results, I do not recommend using the LibriTTS-R dataset; instead, I suggest using the LibriTTS dataset.

Quentin1168 commented 4 days ago

@Quentin1168 Based on my experimental results, I do not recommend using the LibriTTS-R dataset; instead, I suggest using the LibriTTS dataset.

I thought the R one would have clearer speech. May I ask why? Just wanted to understand.

Quentin1168 commented 4 days ago

I'm by no means an expert and it may be best to hear from the project authors, whom I'm sure will also chime in. This model from my reading requires quite a bit of training initially, particularly if you're not finetuning. As someone who's also interested in training from scratch, I'm interested how you get on with this. Issue #509 is where I got a lot of the info on the amount of training required. Hope this helps a bit.

Hey, since you are interested in the progress, here is how it is going at 280k updates.

https://voca.ro/1iQKdblP7fM4

It's a bit better from the one at 200k updates, but still a bit jumbly and a lot of repeated words. I think it does better at very short sentences.

ZhikangNiu commented 4 days ago

@Quentin1168 Based on my experimental results, I do not recommend using the LibriTTS-R dataset; instead, I suggest using the LibriTTS dataset.

I thought the R one would have clearer speech. May I ask why? Just wanted to understand.

Enhanced audio may sound better to the human ear, but it may be detrimental to the model. The results I obtained using LibriTTS-R were quite poor in WER and SIM

leoiania commented 3 days ago

Thank you @Quentin1168 for sharing your results. Can I ask you how long did it take for you to get to 280k updates on a A100 SXM 80 GB vram ? It may be a useful information

ZhikangNiu commented 3 days ago

H1008 Libritts all train datasets and F5 small update 300k steps will need 24hours,so the time will double if a1008发自我的 iPhone在 2024年11月30日,07:07,Leonardo Iania @.***> 写道: Thank you @Quentin1168 for sharing your results. Can I ask you how long did it take for you to get to 280k updates on a A100 SXM 80 GB vram ? It may be a useful information

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

Quentin1168 commented 1 day ago

If with the latest version check the readme for training, simply leverage Hydra config yamls

Otherwise in the older version, is to change the hyperparams in train.py, we have mentioned small model config in our paper, and you can also find in under scripts/ the file including model params

Here's an update. I switched to the small model and have been training with the same parameters and cfg_strength = 3.0 (modified in utils_infer)

Here is the loss graph:

image

Here are some samples of the same sentence above:

https://voca.ro/1mBNluWVXkki https://voca.ro/1iahPbxu1vcr

I'll probably follow Zhikang's suggestion of using the normal LibriTTS but I have limited resources with using this hardware. Should I wait it out a bit more or just stop and retrain with normal LibriTTS later?

Edit: Here are some examples using samples from the dataset:

Ref: https://voca.ro/1CSkqi5GgpxJ Gen: https://voca.ro/1nkJB7ccZFWh

The samples from the dataset seem to do pretty well, but they are short sentences. The model seems to also do well on short sentences in general, but bricks on longer sentences (2 or more phrases).