about some missing parts

keonlee9420 commented 2 years ago

Hi, thanks for your contribution on DiffSinger! and also thanks for mentioning my implementation, I just realized it yesterday:)

With your detailed documentation in README and paper, I can reproduce the training & inference procedure and the results with this repo. But during that, I found some missing parts to get the full training with shallow version: I think the current code only supports forced K (which is 71) with the pre-trained FastSpeech2 (especially of the decoder). If I understood correctly, we need a process for the boundary prediction and pre-training of FastSpeech2 before training DiffSpeech in shallow. Maybe I missed somewhere in the repo, but if it is not yet pushed, I wonder whether you have planned to provide that part soon or not.

Thanks in advance!

Best, keon

MoonInTheRiver commented 2 years ago

Hi, thanks for your contribution on DiffSinger! and also thanks for mentioning my implementation, I just realized it yesterday:)

With your detailed documentation in README and paper, I can reproduce the training & inference procedure and the results with this repo. But during that, I found some missing parts to get the full training with shallow version: I think the current code only supports forced K (which is 71) with the pre-trained FastSpeech2 (especially of the decoder). If I understood correctly, we need a process for the boundary prediction and pre-training of FastSpeech2 before training DiffSpeech in shallow. Maybe I missed somewhere in the repo, but if it is not yet pushed, I wonder whether you have planned to provide that part soon or not.

Thanks in advance!

Best, keon

We've mentioned the new way to obtain k in readme.md (in our paper appendix B). We found that the original "the boundary prediction network" was so cumbersome and with limited generalization. Also, reviewers in the previous venue complain that "this network increases the model complexity...". Simply, just view k as a hyper-parameter, brutal-search it on the val-set or follow our appendix B to get a relative rational value.

as for fs2, it has been pushed in the latest version codes, but it is not the official code for the paper of fs2 :)

Thanks for your attention to our work! Best wishes.

keonlee9420 commented 2 years ago

Thanks for the quick response!

gotcha, I see. Then can you guide me to pre-train fs2? suppose we try to train DiffSpeech on VCTK (e.g. by the conventional speaker embedding adding to text hidden before diffusion), and wanna get pre-trained FS2 including decoder for shallow training (assume we found that we can fix K as the same value with LJS). How can I get it to work with this condition?

MoonInTheRiver commented 2 years ago

Thanks for the quick response!

gotcha, I see. Then can you guide me to pre-train fs2? suppose we try to train DiffSpeech on VCTK (e.g. by the conventional speaker embedding adding to text hidden before diffusion), and wanna get pre-trained FS2 including decoder for shallow training (assume we found that we can fix K as the same value with LJS). How can I get it to work with this condition?

Basic commands for training fs2 (ljspeech): CUDA_VISIBLE_DEVICES=1 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_test --reset If you want to train fs2/ds on multi-spk dataset, you should turn on the 'use_spk_id' or 'use_spk_embed' option & carefully check the related codes. But I'm not sure about the performance on VCTK (I've never tried), maybe some hyper-parameters should be tuned.

keonlee9420 commented 2 years ago

Thanks for your answer! I'll try and hopefully PR if I get some results. Looking forward to the upcoming release :)

MoonInTheRiver commented 2 years ago

I find that DiffGAN-TTS has implemented the multi-speaker TTS of our DiffSpeech, and obtained a very good result😀.

keonlee9420 commented 2 years ago

Thanks for the sharing @MoonInTheRiver ! I'm taking a look, and if possible, I'd like to implement it and share the results soon:)

keonlee9420 commented 2 years ago

@MoonInTheRiver I just released my implementation of DiffGAN-TTS:

https://github.com/keonlee9420/DiffGAN-TTS

and I can see that it generates high-fidelity speech samples within only 4 denoising steps (and even a single step) on VCTK and LJSpeech. Interestingly, in my experiments on comparing DiffGAN-TTS and DiffSpeech, I found that DiffSpeech with 4 steps also shows compatible performance on LJSpeech, so I think we can enjoy fast sampling with either of them! For those who're interested in, please check my repo for more details. Any suggestions are always welcome :)

MoonInTheRiver / DiffSinger

about some missing parts #5