MasayaKawamura / MB-iSTFT-VITS

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Apache License 2.0
401 stars 64 forks source link

Got strange durations #23

Open JohnHerry opened 10 months ago

JohnHerry commented 10 months ago

Thanks for the job, I am trying to make this project on another dataset, which are also 22.05Khz samples, But during traning, the generated speech in evaluation are strange in duration, the speech speed is very slow then the "gt" conterpart, the GT audio is about 5 seconds while the generated audio from evaluating are about 10 seconds. My config is exactly the same with the ljs_mb_istft_vits.json, except for my own filelist and text_cleaner, This kind of config show no such problem in the VITS training, can any body give some suggestion? and, I had also tried to set "add_blank" to false in config, things get better, but still not good enought. and in the original VITS, the add_blank option did not make any trouble in my dataset, true or false.

JohnHerry commented 10 months ago

Thanks for the job, I am trying to make this project on another dataset, which are also 22.05Khz samples, But during traning, the generated speech in evaluation are strange in duration, the speech speed is very slow then the "gt" conterpart, the GT audio is about 5 seconds while the generated audio from evaluating are about 10 seconds. My config is exactly the same with the ljs_mb_istft_vits.json, except for my own filelist and text_cleaner, This kind of config show no such problem in the VITS training, can any body give some suggestion? and, I had also tried to set "add_blank" to false in config, things get better, but still not good enought. and in the original VITS, the add_blank option did not make any trouble in my dataset, true or false.

Is the DDP cost less then the SDP? it seems that the duration with DDP is worse then that from SDP.