Open zakmez opened 5 hours ago
change here https://github.com/SWivid/F5-TTS/blob/333d99ab6c8a4ae9e945a6de012d0bcc2462f754/src/f5_tts/model/dataset.py#L143 to 60s and add between line 187 188
desc_threshold = 30 * data_source.target_sample_rate / data_source.hop_length
add change line 191 to
if (len(batch)+1) * frame_len <= self.frames_threshold / max(frame_len/desc_threshold, 1) and (max_samples == 0 or len(batch) < max_samples):
if training goes well (see even gpu memory distributed), welcome report back
Checks
Question details
I'm fine-tuning F5-TTS for speech synthesis and need to handle longer paragraph-level audio samples (~60 seconds) in the training dataset.Is it possible and how ?