fumiama / Retrieval-based-Voice-Conversion-WebUI

Easily train a good VC model with voice data <= 10 mins!
GNU Affero General Public License v3.0
137 stars 19 forks source link

net_g Synthesizer generates a horrible noise #87

Open AznamirWoW opened 1 month ago

AznamirWoW commented 1 month ago

To Reproduce

1) train a new voice model using a single 3 sec silent audio file, 1 epoch. Run inference on the same 3 sec silent audio using the model you created. Analyze the inference result using a spectrogram. Notice solid lines at several frequency levels.

image

2) modify RVC code and comment out the line that loads weights from a voice model, thus keeping Synthesizer in a blank slate. https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/96e3d8af403ca695ba460c71169f4e4c5c901d72/rvc/synthesizer.py#L23

3) run inference on the same 3 sec audio file, Analyze the inference result using a spectrogram. Notice solid lines at every 1KHz level.

image

Expected behavior One would expect that training a model from scratch does not introduce artifacts that are not present in source audio files.

Screenshots see above

fumiama commented 1 month ago

Are you using the web ui? In this case, you can check about whether the program still loads the pretrained model (named like f0Dxx.pth). If so, remove them all and try it again.

AznamirWoW commented 1 month ago

Are you using the web ui? In this case, you can check about whether the program still loads the pretrained model (named like f0Dxx.pth). If so, remove them all and try it again.

In my investigation I tried making two models using a silent file - one model using a default pretrained f0D32k.pth+f0G32k.pth image

and one model without using a default pretrain. image

Then I tried using inference using the resulting models (no index) and since the result was different I'm sure no weights were used. Also since I've manually disabled a load of weights into synthesizer's state_dict that test's result was also different, and much worse, than the test using trained models.

AznamirWoW commented 1 month ago

there's some difference in spectrograms using different upsample_rates and upsample_kernel_sizes, but I could not quite eliminate the overlapping

upsample_rates = [10, 8, 2, 2]

10x 100Hz lines = 1KHz band 8x 1KHz lines = 8KHz band 2x 8KHz lines = 16KHz band 2x = 32KHz final upsample

so there are 10 faint lines within each 1KHz band, they overlap at each KHz, so each KHz there's twice as strong line and two 8KHz bands overlap at 8KHz, creating 4x strong line

some examples:

10,8,4,4 image

10,8,4,2 image

20,16,4,4 image

SayanoAI commented 1 month ago

How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):

image

AznamirWoW commented 1 month ago

How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):

I'm using Spek https://www.spek.cc/p/download

AznamirWoW commented 1 month ago

How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):

Audacity also shows it

image

SayanoAI commented 1 month ago

Did you do anything special with the settings? Why does mine look black? I'm using the default spectogram settings: image

AznamirWoW commented 1 month ago

Did you do anything special with the settings?

It looks like RVC (I used RVC1006Nvidia.7z build) has an issue with inferring a short mute audio - it does not do anything, so just like you I could not reproduce it with just a mute wav. Maybe it automatically apply the index, so the index masks the noise.

But then I've trained a model from scratch, one epoch, no pretrained, and used it without an index to infer a longer audio and here it is: image

model in question: https://drive.google.com/file/d/142jVVY5194kMaxoa6Teayz9nS49DX2-X/view?usp=sharing audio: https://drive.google.com/file/d/1SmzHLol-ShTvcd-aK5kRP_Yp0q5gILaO/view?usp=sharing