Vits 48Khz quality - Githubissues

dutchsing009 commented 1 year ago

hello, in your demo for DualCycleGAN, Vits quality in tts synthesizing 48khz audio was mostly worse than the original demo https://jaywalnut310.github.io/vits-demo/index.html , noting that the original demo for vctk was in 22.5Khz and was trained up to 800k steps. what do you think is the reason , is the author of vits cherry picking the results or something else ? up to how many steps did you train your 48khz model ? 1 million steps?? was it maybe a problem in the custom config you had ?? can you share the 48khz vits config if you can?? vits original demo shows almost perfect results for multispeaker vctk and its many-to-many voice conversion quality (1:1 Ground truth)?

r9y9 commented 1 year ago

Hi, I am one of the co-authors and I performed TTS experiments.

hello, in your demo for DualCycleGAN, Vits quality in tts synthesizing 48khz audio was mostly worse than the original demo https://jaywalnut310.github.io/vits-demo/index.html , noting that the original demo for vctk was in 22.5Khz and was trained up to 800k steps. what do you think is the reason , is the author of vits cherry picking the results or something else ?

First of all, It is difficult to judge which is better without performing subjective tests with enough subjects. It is easy to cherry-pick some samples and say one is better than the other. That being said, the followings are possible factors that make quality differences:

Learning TTS models for full-band (i.e. 48kHz) speech is more difficult than that of 22.5kHz sampling.
High-frequency noise in the VCTK corpus may have affected the TTS model, while that noise can be negligible for (band-limited) 22.5 kHz signals.
Train/dev/eval split is different from ours and the ones from the original VITS

how many steps did you train your 48khz model ? 1 million steps??

1000 K steps. Please check our paper for details.

can you share the 48khz vits config if you can??

https://github.com/jaywalnut310/vits/blob/main/configs/vctk_base.json with the following changes:

data:

<     filter_length: 1024
<     win_length: 1024
<     hop_length: 256
---
>     filter_length: 2048
>     win_length: 1920
>     hop_length: 480

model:

<         upsample_rates: [8,8,2,2]
<         upsample_kernel_sizes: [16,16,4,4]
---
>         upsample_rates: [6,5,4,2,2]
>         upsample_kernel_sizes: [12,11,8,4,4]

Note that we are using our fork of VITS with YAML-based config (not JSON). You may need to tweak your config if you use the original VITS codebase.

r9y9 commented 1 year ago

Shall we close this issue?

fmac2000 commented 1 year ago

@r9y9 thanks, this was useful - may I ask what machine spec you used? - I'd like to train this myself but I'm looking to figure out the pricing before I train.

Thanks for the paper, you guys are killing it right now

chomeyama / DualCycleGAN

Vits 48Khz quality #1