Open AznamirWoW opened 1 month ago
Are you using the web ui? In this case, you can check about whether the program still loads the pretrained model (named like f0Dxx.pth). If so, remove them all and try it again.
Are you using the web ui? In this case, you can check about whether the program still loads the pretrained model (named like f0Dxx.pth). If so, remove them all and try it again.
In my investigation I tried making two models using a silent file - one model using a default pretrained f0D32k.pth+f0G32k.pth
and one model without using a default pretrain.
Then I tried using inference using the resulting models (no index) and since the result was different I'm sure no weights were used. Also since I've manually disabled a load of weights into synthesizer's state_dict that test's result was also different, and much worse, than the test using trained models.
there's some difference in spectrograms using different upsample_rates and upsample_kernel_sizes, but I could not quite eliminate the overlapping
upsample_rates = [10, 8, 2, 2]
10x 100Hz lines = 1KHz band 8x 1KHz lines = 8KHz band 2x 8KHz lines = 16KHz band 2x = 32KHz final upsample
so there are 10 faint lines within each 1KHz band, they overlap at each KHz, so each KHz there's twice as strong line and two 8KHz bands overlap at 8KHz, creating 4x strong line
some examples:
10,8,4,4
10,8,4,2
20,16,4,4
How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):
How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):
I'm using Spek https://www.spek.cc/p/download
How are you rendering those graphs in audacity? I was not able to replicate this using the default mel spec in audacity (voice model applied, original, blank net_g):
Audacity also shows it
Did you do anything special with the settings? Why does mine look black? I'm using the default spectogram settings:
Did you do anything special with the settings?
It looks like RVC (I used RVC1006Nvidia.7z build) has an issue with inferring a short mute audio - it does not do anything, so just like you I could not reproduce it with just a mute wav. Maybe it automatically apply the index, so the index masks the noise.
But then I've trained a model from scratch, one epoch, no pretrained, and used it without an index to infer a longer audio and here it is:
model in question: https://drive.google.com/file/d/142jVVY5194kMaxoa6Teayz9nS49DX2-X/view?usp=sharing audio: https://drive.google.com/file/d/1SmzHLol-ShTvcd-aK5kRP_Yp0q5gILaO/view?usp=sharing
To Reproduce
1) train a new voice model using a single 3 sec silent audio file, 1 epoch. Run inference on the same 3 sec silent audio using the model you created. Analyze the inference result using a spectrogram. Notice solid lines at several frequency levels.
2) modify RVC code and comment out the line that loads weights from a voice model, thus keeping Synthesizer in a blank slate. https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/96e3d8af403ca695ba460c71169f4e4c5c901d72/rvc/synthesizer.py#L23
3) run inference on the same 3 sec audio file, Analyze the inference result using a spectrogram. Notice solid lines at every 1KHz level.
Expected behavior One would expect that training a model from scratch does not introduce artifacts that are not present in source audio files.
Screenshots see above