KoeAI / LLVC

MIT License
372 stars 31 forks source link

Issue in finetuning on new dataset #13

Open slimsushi opened 5 months ago

slimsushi commented 5 months ago

Hi,

I tried to finetune the model using the G_500000.pt checkpoint provided in the repro with a new Discriminator made from scratch (no pretrained checkpoint used for the Discriminator). I used a German dataset which contains voiced and unvoiced audio recordings and I converted them to a new Voice (with FreeVC) to make the model more robust on real world scenarios. The audio recordings are all Resampled to 16kHz.

Now my Problem: When testing the conversion with the pretrained model it converts my audios quite good (results could be better on the distorted examples but where ok) But when I use the finetuned model with the new Voice, the converted audio examples have lost a lot of Gain/Signal to Noise Ratio. When listening to the files you cannot hear anything. When plotting the results you can see a Waveform but the amplitude goes only up to 0.2*10^-5 which is way less than when using the pretrained model without finetuning (here the converted examples had an amplitude ranche of -0.5 to 0.5). Then I tried to normalise the audio to be between -1 to 1, they started sounding really really noisy (no voice can be identified). I tested checkpoints I made from 10000 steps to 100000 steps and there was no improvement. (all had the same problem) I trained for around 150 Epochs or 100000 steps with my data. (around 10 hours training I guess) Also when I am not using the pretrained G_500000.pt checkpoint and instead train the generator from scratch it has this low SNR problem. Is there maybe a reason/problem known why the outputs are so low in SNR? Can I try something to improve the results? Or is some special preprocessing made with your dataset which I havent done? I used a sampling rate of 16kHz for the original and I used FreeVC to make my 16kHz converted examples in the new voice (all recordings from my dataset are sounding fine when listening to them). The fileformat of my recordings is .wav.

thanks for your help and great work!

Yaodada12 commented 5 months ago

Hi,

I tried to finetune the model using the G_500000.pt checkpoint provided in the repro with a new Discriminator made from scratch (no pretrained checkpoint used for the Discriminator). I used a German dataset which contains voiced and unvoiced audio recordings and I converted them to a new Voice (with FreeVC) to make the model more robust on real world scenarios. The audio recordings are all Resampled to 16kHz.

Now my Problem: When testing the conversion with the pretrained model it converts my audios quite good (results could be better on the distorted examples but where ok) But when I use the finetuned model with the new Voice, the converted audio examples have lost a lot of Gain/Signal to Noise Ratio. When listening to the files you cannot hear anything. When plotting the results you can see a Waveform but the amplitude goes only up to 0.2*10^-5 which is way less than when using the pretrained model without finetuning (here the converted examples had an amplitude ranche of -0.5 to 0.5). Then I tried to normalise the audio to be between -1 to 1, they started sounding really really noisy (no voice can be identified). I tested checkpoints I made from 10000 steps to 100000 steps and there was no improvement. (all had the same problem) I trained for around 150 Epochs or 100000 steps with my data. (around 10 hours training I guess) Also when I am not using the pretrained G_500000.pt checkpoint and instead train the generator from scratch it has this low SNR problem. Is there maybe a reason/problem known why the outputs are so low in SNR? Can I try something to improve the results? Or is some special preprocessing made with your dataset which I havent done? I used a sampling rate of 16kHz for the original and I used FreeVC to make my 16kHz converted examples in the new voice (all recordings from my dataset are sounding fine when listening to them). The fileformat of my recordings is .wav.

thanks for your help and great work!

When I used my own trained RVC single speaker model to convert the train-360 data into paired data to train llvc model from scratch, the loss did not converge at all. After about 55k steps, the loss directly soared to nan. then it was always nan,Have you encountered similar problems? train.log