MasayaKawamura / MB-iSTFT-VITS

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Apache License 2.0
401 stars 64 forks source link

Training Time #4

Closed FanhuaandLuomu closed 1 year ago

FanhuaandLuomu commented 1 year ago

HI, can you share your training speed with A100, such as the time cost for each 10k step. i'm training ms-istft-vits, and find it is slower than original vits.

MasayaKawamura commented 1 year ago

Hi! Even In the case of my training model with A100, I confirmed that the total training time of the ms-istft-vits is slow... I have confirmed that ms-istft-vits is slower than mb-istft-vits in training time.

leminhnguyen commented 1 year ago

Hi @MasayaKawamura, VITS model suffered from mispronunciation, so it usually has a large CER or WER when compared to other models. Did you see any pronunciation improvements for this model? Anyway, thank you for your amazing work.

FanhuaandLuomu commented 1 year ago

Hi @MasayaKawamura , how many steps would the model get a relative good effect in your experiments ? I see in the paper that you trained 800k steps.

MasayaKawamura commented 1 year ago

Hi @leminhnguyen, thank you for the question. I have not done any comparisons on WER, etc., so I don't know for sure. From the few samples (you can check audio sample on this demo page), I think there are few critical word errors. However, WER depends on the input text length, so I think a detailed analysis is needed.

This paper may be helpful about VITS and WER.

MasayaKawamura commented 1 year ago

Hi @FanhuaandLuomu, thank you for the question. In the paper, all models were trained at 800k steps to match experimental conditions. I think It is a difficult problem because the hyperparameters and dataset are also related to how many steps are needed to obtain relatively good quality. This is just my opinion, I think you can synthesize a relatively good speech under 800k (I'll have to evaluate the specific number of steps with MOS to be sure).

FanhuaandLuomu commented 1 year ago

Hi, thanks for your great work. Can you open source your small version model structure,thanks again. @MasayaKawamura

MasayaKawamura commented 1 year ago

Hi @FanhuaandLuomu I added the config file for Mini-iSTFT-VITS and Mini-MB-iSTFT-VITS that were described in our paper. Please check the configs.

guoyingying432 commented 1 year ago

@FanhuaandLuomu @MasayaKawamura I have a problem, how long did it take for you to train 800K steps? Maybe one week?

MasayaKawamura commented 1 year ago

Hi, @guoyingying432 I think the computation time depends on the hyperparameters, GPU, etc... In the conditions of the paper, it takes about one or two weeks.