Training with longer audio samples (60s paragraphs)

SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"

https://arxiv.org/abs/2410.06885

MIT License

7.15k stars 854 forks source link

Training with longer audio samples (60s paragraphs) #478

Open zakmez opened 5 hours ago

zakmez commented 5 hours ago

Checks

[X] This template is only for question, not feature requests or bug reports.
[X] I have thoroughly reviewed the project documentation and read the related paper(s).
[X] I have searched for existing issues, including closed ones, no similar questions.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Question details

I'm fine-tuning F5-TTS for speech synthesis and need to handle longer paragraph-level audio samples (~60 seconds) in the training dataset.Is it possible and how ?

SWivid commented 4 hours ago

change here https://github.com/SWivid/F5-TTS/blob/333d99ab6c8a4ae9e945a6de012d0bcc2462f754/src/f5_tts/model/dataset.py#L143 to 60s and add between line 187 188

desc_threshold = 30 * data_source.target_sample_rate / data_source.hop_length

add change line 191 to

if (len(batch)+1) * frame_len <= self.frames_threshold / max(frame_len/desc_threshold, 1) and (max_samples == 0 or len(batch) < max_samples):

if training goes well (see even gpu memory distributed), welcome report back