Comparative Analysis and Training Results of VITS2 with HifiGAN, iSTFT and BigVGAN

FENRlR / MB-iSTFT-VITS2

Application of MB-iSTFT-VITS components to vits2_pytorch

MIT License

118 stars 29 forks source link

Comparative Analysis and Training Results of VITS2 with HifiGAN, iSTFT and BigVGAN #2

Open shigabeev opened 1 year ago

shigabeev commented 1 year ago

Greetings,

First and foremost, I'd like to extend my commendations on developing such an outstanding model; its performance surpasses anything I have personally trained thus far. It's a noteworthy contribution to the field, and I applaud your work.

I've conducted a series of training experiments to validate the efficiency and efficacy of your model. For ease of reference, I've made the training results, model weights, and TensorBoard logs publicly accessible. You can review them via the following Google Drive link: Training Results and Model Weights

Moreover, I've prepared audio samples that compare the performance of your model with that of VITS2, HifiGAN, and BigVGAN. This will offer a comprehensive perspective on how your model stacks up against other state-of-the-art solutions in the domain. Comparative Audio Samples

Best wishes

FENRlR commented 1 year ago

A huge thank you for sharing the results. The main reason of using iSTFT here was its fast synthesis speed that it showed from its original VITS variant. As so, I would say the result is far beyond my expectations. Magnificent.

shigabeev commented 1 year ago

@FENRlR do you know by chance the optimal configs for different sampling rates? I need 16kHz, 24kHz and 48kHz.

FENRlR commented 1 year ago

Currently, no. It seems there were some issues with 16kHz sampling rate in the original iSTFT repo. I've never seen the other two, however.

p0p4k commented 1 year ago

@FENRlR hi, can you add me on discord and ping me? (id -> p0p4k)' thanks.

DavidNTompkins commented 1 year ago

Super neat! Was this on an A100? Looks like it took ~3 days?

Insensiblee commented 1 year ago

I downloaded the model from the web disk you provided, and reported this error when reasoning, do you know how to solve it? RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([155, 192]) from checkpoint, the shape in current model is torch.Size([205, 192]).

shigabeev commented 1 year ago

I downloaded the model from the web disk you provided, and reported this error when reasoning, do you know how to solve it? RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([155, 192]) from checkpoint, the shape in current model is torch.Size([205, 192]).

Hey, it's possible that the repository have changed and some weight sizes don't match defaults anymore. The easiest way to run it is to go back to the commit that dates back to the time of the post, clone it, plug in the weights and launch it from there.

FENRlR commented 1 year ago

@Insensiblee Before reverting back to that commit, have you tried changing symbols? The length of symbols he used for Russian is exactly 155, while 205 is the length of the default symbol. So I'm 90% sure that you've forgot to modify it.