espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.4k stars 2.17k forks source link

What is the best english TTS available pretrained for today? #1752

Closed qo4on closed 4 years ago

qo4on commented 4 years ago

Sorry for a noob question, what is the best (in terms of quality) english TTS available pretrained for today? Is it a following combination or there is something better?

  1. Tacotron2 | char_train_no_dev_pytorch_train_pytorch_tacotron2.v3 | https://drive.google.com/open?id=1Jo06IbVlq79lMA5wM9OMuZ-ByH1eRPkC

  2. Parallel WaveGAN | ljspeech.parallel_wavegan.v1.limit | https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB

sw005320 commented 4 years ago

Best in terms of speed or quality? If you're talking about the quality, according to https://arxiv.org/pdf/1910.10909.pdf, Transformer.v3 is slightly better than Tacotron2.v3. As far as I know, we do not test the MOS evaluation across our recent vocoders, but Parallel WaveGAN is quite good quality and very fast. So, Parallel WaveGAN is a reasonable option.

kan-bayashi commented 4 years ago

I will add information. There are several conditions to decide the quality.

  1. Input format: Char VS Phoneme Char- and phoneme-based models are almost the same quality but the phoneme-based one is slightly better than the char-based. From the paper Shinji mentioned, Transformer.v3 is slightly better than Taco2.v3 but I think this difference comes from the difference of the input format.

  2. Text2Mel model (Taco 2 VS Transformer VS FastSpeech) In terms of the naturalness, Taco 2 ≒ Transformer > FastSpeech. In terms of the speed, FastSpeech > Taco 2 > Transformer. In terms of the stableness of generation, FastSpeech >= Taco 2 (w/ constraint) > Transformer. Taco 2 can use attention constraint method in decoding, which can generate the long sentence stably while Transformer cannot use it. FastSpeech is non-autoregressive so the speed is very fast and stable but current model is slightly worse than Taco2 and Transformer. Maybe the quality will be improved if we use more text to generate the training data.

  3. Mel2Wav model (WaveNet VS Parallel WaveGAN VS MelGAN) In terms of naturalness, MoL-Wavenet > Parallel WaveGAN > MelGAN. In terms of speed, MelGAN > Parallel WaveGAN >>>>>>>>> WaveNet. PWG has good balance of the quality and the speed.

Therefore, in terms of the quality, the best one is

qo4on commented 4 years ago

What do you mean by stableness of generation? Quality or speed?

kan-bayashi commented 4 years ago

Text2Mel model sometimes causes the deletion of the words or fails to stop the generation, especially in the case of the very long input sentence. High stableness means less these kind of problems.

qo4on commented 4 years ago

Therefore, in terms of the quality, the best one is

  • Phoneme-based tacotron.v3 (or v4) + MoL-WaveNet

Can you share links for this?

kan-bayashi commented 4 years ago

Phoneme-based Text2Mel models: https://github.com/espnet/espnet/blob/master/egs/ljspeech/tts1/RESULTS.md#v060-with-frequency-limit-transformer-and-tacotron-2

Compatible MoL-WaveNet: https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr

qo4on commented 4 years ago

I found that low robustness of Taco 2 and Transformer can greatly reduce "quality". So, can you share a link to best FastSpeech model?

kan-bayashi commented 4 years ago

See. https://github.com/espnet/espnet/blob/master/egs/ljspeech/tts1/RESULTS.md#v061-knowledge-distillation-fastspeech

For Tacotron2, it is better to try use-attention-constraint: true in decoding config.

qo4on commented 4 years ago

Sorry, I'm confused. There are three models there:

phn_train_no_dev_pytorch_train_tacotron2.v2_fastspeech.v4.single
https://drive.google.com/open?id=1ReWzefflfDfohan3r9t0s--ofYlJNQOt

phn_train_no_dev_pytorch_train_tacotron2.v3_fastspeech.v4.single
https://drive.google.com/file/d/1P9I4qag8wAcJiTCPawt6WCKBqUfJFtFp

phn_train_no_dev_pytorch_train_transformer.v3_fastspeech.v4.single_filter_fr_thres0.9
https://drive.google.com/file/d/1ggtkxpI67htyZ3st6jJOeBNwToy2itSp

What is the difference?

kan-bayashi commented 4 years ago

Teacher model for knowledge distillation is different.

qo4on commented 4 years ago

I don't understand. Different from what? How can they be tacotron2 and at fastspeech at the same time?

kan-bayashi commented 4 years ago

FastSpeech requires teacher model for the training. Please read fastspeech paper for more detail.

qo4on commented 4 years ago

Compatible MoL-WaveNet: https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr

This WaveNet arhive contains pth and json. But your Colab example takes pkl and yml. Is there any example of using WaveNet vocoder with FastSpeech? Or do I need another WaveNet for FastSpeech?

Update: I see that your Colab example is hardcoded to use ParallelWaveGAN:

# define neural vocoder
import yaml
import parallel_wavegan.models
with open(vocoder_conf) as f:
    config = yaml.load(f, Loader=yaml.Loader)
vocoder_class = config.get("generator_type", "ParallelWaveGANGenerator")
vocoder = getattr(parallel_wavegan.models, vocoder_class)(**config["generator_params"])
kan-bayashi commented 4 years ago

See this part. https://github.com/espnet/espnet/blob/95ecf5f15fe2577ec649363e9700cd5721885c02/utils/synth_wav.sh#L317-L329

qo4on commented 4 years ago

Thank you. As I see you are looking for .mol. in the name and if it found downloading a wavenet. I'm sorry, I'm not very familiar with bash. but I cant run wavenet. I downloaded model, changed paths:

if not os.path.exists("downloads/en/wavenet"):
    !./espnet/utils/download_from_google_drive.sh \
        https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr downloads/en/wavenet tar.gz

vocoder_path = "downloads/en/wavenet/ljspeech.wavenet.mol.v2/191108_mol_wavenet_mel7600_step001000000_ema.pth"
vocoder_conf = "downloads/en/wavenet/ljspeech.wavenet.mol.v2/hparams.json"

But got an error in setup:

     26     config = yaml.load(f, Loader=yaml.Loader)
     27 vocoder_class = config.get("generator_type", "ParallelWaveGANGenerator")
---> 28 vocoder = getattr(parallel_wavegan.models, vocoder_class)(**config["generator_params"])
     29 vocoder.load_state_dict(torch.load(vocoder_path, map_location="cpu")["model"]["generator"])
     30 vocoder.remove_weight_norm()

KeyError: 'generator_params'

Do you know what I'm doing wrong?

By the way, you have a mistake in naming, if I understand correctly it should be a wavenet v2:

ljspeech.wavenet.mol.v1.limit.tar.gz
ljspeech.wavenet.mol.v2
https://drive.google.com/file/d/1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr
kan-bayashi commented 4 years ago

I'm not sure what you are doing. Parallel WaveGAN and WaveNet are different. You cannot use the code for Parallel WaveGAN, which is shown in the colab notebook. MoL-WaveNet is based on https://github.com/r9y9/wavenet_vocoder, as shown in README. Please check it by yourself and see my post https://github.com/espnet/espnet/issues/1752#issuecomment-606973232.

We are not a support desk service. Please try to figure out such kind of issue by yourself as possible as you can.

nvadigauvce commented 4 years ago

@kan-bayashi I am also facing this issue of Tacotron2 model sometimes causes the deletion of the words or fails to stop the generation for longer and unseen texts, Is there any solution to this problem ?

turian commented 3 years ago

@kan-bayashi I don't see MoL-WaveNet in the espnet2 colab, can it be added?