16kHz training produces noisy audio

rumourscape commented 3 hours ago

Checks

[X] This template is only for bug reports, usage problems go with 'Help Wanted'.
[X] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
[X] I have searched for existing issues, including closed ones, and couldn't find a solution.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

Ubuntu 22.04, CUDA=12.4, pytorch=2.5.0, vocoder = bigvgan trained on 16kHz with same config as given below

Steps to Reproduce

Train a model from scratch using the following configurations:

target_sample_rate = 16000 n_mel_channels = 80 hop_length = 160 win_length = 640 n_fft = 1024

✔️ Expected Behavior

Should give a clean speech audio as produced by 24kHz model

❌ Actual Behavior

Model produces very noisy speech, although it has learned to say words well. Sample output from my 16kHz model: https://asr.iitm.ac.in/cdac/gpu16/sample.wav

ZhikangNiu commented 2 hours ago

pretrained BigVGAN don't support 16khz audio

ZhikangNiu commented 2 hours ago

https://github.com/NVIDIA/BigVGAN/tree/main

rumourscape commented 2 hours ago

pretrained BigVGAN don't support 16khz audio

I trained BigVGAN from scratch for 16kHz as mention in environment details. I have tried using other 16khz vocoders like hifigan with same config and I still get noisy audio.

ZhikangNiu commented 2 hours ago

How many training steps and which dataset?

rumourscape commented 2 hours ago

How many training steps and which dataset?

I used a 14 hrs Indian English dataset (https://huggingface.co/datasets/SPRINGLab/smt_english) and trained for close to 2000 epochs or about 500k steps

SWivid / F5-TTS