Closed qo4on closed 4 years ago
Best in terms of speed or quality? If you're talking about the quality, according to https://arxiv.org/pdf/1910.10909.pdf, Transformer.v3 is slightly better than Tacotron2.v3. As far as I know, we do not test the MOS evaluation across our recent vocoders, but Parallel WaveGAN is quite good quality and very fast. So, Parallel WaveGAN is a reasonable option.
I will add information. There are several conditions to decide the quality.
Input format: Char VS Phoneme Char- and phoneme-based models are almost the same quality but the phoneme-based one is slightly better than the char-based. From the paper Shinji mentioned, Transformer.v3 is slightly better than Taco2.v3 but I think this difference comes from the difference of the input format.
Text2Mel model (Taco 2 VS Transformer VS FastSpeech) In terms of the naturalness, Taco 2 ≒ Transformer > FastSpeech. In terms of the speed, FastSpeech > Taco 2 > Transformer. In terms of the stableness of generation, FastSpeech >= Taco 2 (w/ constraint) > Transformer. Taco 2 can use attention constraint method in decoding, which can generate the long sentence stably while Transformer cannot use it. FastSpeech is non-autoregressive so the speed is very fast and stable but current model is slightly worse than Taco2 and Transformer. Maybe the quality will be improved if we use more text to generate the training data.
Mel2Wav model (WaveNet VS Parallel WaveGAN VS MelGAN) In terms of naturalness, MoL-Wavenet > Parallel WaveGAN > MelGAN. In terms of speed, MelGAN > Parallel WaveGAN >>>>>>>>> WaveNet. PWG has good balance of the quality and the speed.
Therefore, in terms of the quality, the best one is
What do you mean by stableness of generation
? Quality or speed?
Text2Mel model sometimes causes the deletion of the words or fails to stop the generation, especially in the case of the very long input sentence. High stableness means less these kind of problems.
Therefore, in terms of the quality, the best one is
- Phoneme-based tacotron.v3 (or v4) + MoL-WaveNet
Can you share links for this?
Phoneme-based Text2Mel models: https://github.com/espnet/espnet/blob/master/egs/ljspeech/tts1/RESULTS.md#v060-with-frequency-limit-transformer-and-tacotron-2
Compatible MoL-WaveNet: https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr
I found that low robustness of Taco 2 and Transformer can greatly reduce "quality". So, can you share a link to best FastSpeech model?
For Tacotron2, it is better to try use-attention-constraint: true
in decoding config.
Sorry, I'm confused. There are three models there:
phn_train_no_dev_pytorch_train_tacotron2.v2_fastspeech.v4.single
https://drive.google.com/open?id=1ReWzefflfDfohan3r9t0s--ofYlJNQOt
phn_train_no_dev_pytorch_train_tacotron2.v3_fastspeech.v4.single
https://drive.google.com/file/d/1P9I4qag8wAcJiTCPawt6WCKBqUfJFtFp
phn_train_no_dev_pytorch_train_transformer.v3_fastspeech.v4.single_filter_fr_thres0.9
https://drive.google.com/file/d/1ggtkxpI67htyZ3st6jJOeBNwToy2itSp
What is the difference?
Teacher model for knowledge distillation is different.
I don't understand. Different from what? How can they be tacotron2 and at fastspeech at the same time?
FastSpeech requires teacher model for the training. Please read fastspeech paper for more detail.
Compatible MoL-WaveNet: https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr
This WaveNet arhive contains pth
and json
. But your Colab example takes pkl
and yml
. Is there any example of using WaveNet vocoder with FastSpeech?
Or do I need another WaveNet for FastSpeech?
Update:
I see that your Colab example is hardcoded to use ParallelWaveGAN
:
# define neural vocoder
import yaml
import parallel_wavegan.models
with open(vocoder_conf) as f:
config = yaml.load(f, Loader=yaml.Loader)
vocoder_class = config.get("generator_type", "ParallelWaveGANGenerator")
vocoder = getattr(parallel_wavegan.models, vocoder_class)(**config["generator_params"])
Thank you.
As I see you are looking for .mol.
in the name and if it found downloading a wavenet. I'm sorry, I'm not very familiar with bash. but I cant run wavenet. I downloaded model, changed paths:
if not os.path.exists("downloads/en/wavenet"):
!./espnet/utils/download_from_google_drive.sh \
https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr downloads/en/wavenet tar.gz
vocoder_path = "downloads/en/wavenet/ljspeech.wavenet.mol.v2/191108_mol_wavenet_mel7600_step001000000_ema.pth"
vocoder_conf = "downloads/en/wavenet/ljspeech.wavenet.mol.v2/hparams.json"
But got an error in setup:
26 config = yaml.load(f, Loader=yaml.Loader)
27 vocoder_class = config.get("generator_type", "ParallelWaveGANGenerator")
---> 28 vocoder = getattr(parallel_wavegan.models, vocoder_class)(**config["generator_params"])
29 vocoder.load_state_dict(torch.load(vocoder_path, map_location="cpu")["model"]["generator"])
30 vocoder.remove_weight_norm()
KeyError: 'generator_params'
Do you know what I'm doing wrong?
By the way, you have a mistake in naming, if I understand correctly it should be a wavenet v2:
ljspeech.wavenet.mol.v1.limit.tar.gz
ljspeech.wavenet.mol.v2
https://drive.google.com/file/d/1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr
I'm not sure what you are doing. Parallel WaveGAN and WaveNet are different. You cannot use the code for Parallel WaveGAN, which is shown in the colab notebook. MoL-WaveNet is based on https://github.com/r9y9/wavenet_vocoder, as shown in README. Please check it by yourself and see my post https://github.com/espnet/espnet/issues/1752#issuecomment-606973232.
We are not a support desk service. Please try to figure out such kind of issue by yourself as possible as you can.
@kan-bayashi I am also facing this issue of Tacotron2 model sometimes causes the deletion of the words or fails to stop the generation for longer and unseen texts, Is there any solution to this problem ?
@kan-bayashi I don't see MoL-WaveNet in the espnet2 colab, can it be added?
Sorry for a noob question, what is the best (in terms of quality) english TTS available pretrained for today? Is it a following combination or there is something better?
Tacotron2 | char_train_no_dev_pytorch_train_pytorch_tacotron2.v3 | https://drive.google.com/open?id=1Jo06IbVlq79lMA5wM9OMuZ-ByH1eRPkC
Parallel WaveGAN | ljspeech.parallel_wavegan.v1.limit | https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB