SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.25k stars 863 forks source link

16kHz training produces noisy audio #491

Open rumourscape opened 3 hours ago

rumourscape commented 3 hours ago

Checks

Environment Details

Ubuntu 22.04, CUDA=12.4, pytorch=2.5.0, vocoder = bigvgan trained on 16kHz with same config as given below

Steps to Reproduce

Train a model from scratch using the following configurations:

target_sample_rate = 16000 n_mel_channels = 80 hop_length = 160 win_length = 640 n_fft = 1024

✔️ Expected Behavior

Should give a clean speech audio as produced by 24kHz model

❌ Actual Behavior

Model produces very noisy speech, although it has learned to say words well. Sample output from my 16kHz model: https://asr.iitm.ac.in/cdac/gpu16/sample.wav

ZhikangNiu commented 2 hours ago

pretrained BigVGAN don't support 16khz audio image

ZhikangNiu commented 2 hours ago

https://github.com/NVIDIA/BigVGAN/tree/main

rumourscape commented 2 hours ago

pretrained BigVGAN don't support 16khz audio

I trained BigVGAN from scratch for 16kHz as mention in environment details. I have tried using other 16khz vocoders like hifigan with same config and I still get noisy audio.

ZhikangNiu commented 2 hours ago

How many training steps and which dataset?

rumourscape commented 2 hours ago

How many training steps and which dataset?

I used a 14 hrs Indian English dataset (https://huggingface.co/datasets/SPRINGLab/smt_english) and trained for close to 2000 epochs or about 500k steps