Kyubyong / tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Apache License 2.0
1.83k stars 436 forks source link

Difference Between Current Code and Original Paper #29

Open candlewill opened 7 years ago

candlewill commented 7 years ago
  1. Learning rate decay. In the original paper, the learning rate decay starts from 0.001 and is reduced to 0.0005, 0.0003, and 0.0001 after 500K, 1M and 2M global steps respectively. While the code uses a fixed learning rate of 0.001.

  2. no batch-norm for conv1d in encoder (https://github.com/Kyubyong/tacotron/issues/12).

  3. wrong size of conv1d in CBHG, Post-processing net (https://github.com/Kyubyong/tacotron/issues/13)

  4. CBHG structure in post-processing net does not use residual connection. This may be a compromise, because the residuals are added only if the dimensions are the same. The original paper is unclear.

  5. The last layer of decoder uses a fully connected layer to predict the mel spectrogram. The paper says that it is an important trick to predict r frames at each decoder step. It is unclear whether T = T' or T! = T' in the process of [N, T, C] -> [N, T ', C * r]. The code keeps T=T', but it is also possible that T' = T / r with frame reduction.

  6. Decoder input problem. The paper says, in inference, only the last frame of the r predictions is fed into the decoder (except for the last step). However, the code uses all of the r frames. During training, there are the same problem, every r-th ground truth frame is fed into the decoder, rather than all of the r frames.

onyedikilo commented 7 years ago

What about the pre-emphasis 0.97?