Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Implementation Status and planned TODOs #4

Closed Rayhane-mamah closed 6 years ago

Rayhane-mamah commented 6 years ago

this umbrella issue tracks my current progress and discuss priority of planned TODOs. It has been closed since all objectives are hit.

Goal

Model

Feature Prediction Model (Done)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

Scripts

Extra (optional):

Notes:

All models in this repository will be implemented in Tensorflow on a first stage, so in case you want to use a Wavenet vocoder implemented in Pytorch you can refer to this repository that shows very promising results.

ferintphilipose commented 6 years ago

@Rayhane-mamah , Hi.. I would like to understand the running of this experiment. So in order to clarify my understanding of the training process, During training, each input is trained over for 100 times and at the 100th iteration it is predicted out and saved as ljspeech-mel-prediction-step-(n*100).npy ?

The predicted mel spectrograms have a higher number of frames compared to their corresponding ground truth. Could this be due to the difference in the data processing method I use for deriving the initial ground truth?

The predicted mel spectrogram also seem to have negative values in contrast to the completely positive valued ground truth mel spectrograms.

If you could shed some light into this, it would be great. Thanks a lot. :)

Rayhane-mamah commented 6 years ago

Hello @ferintphilipose, thanks for reaching out!

Actually, training data is always mostly random. Using a tensorflow feeder, we pick a set of random samples, create batches based of data length (to minimize paddings) and feed data to the model (all with shuffling).

So, each 100 steps, we are actually training the model on 100 batches created randomly from the training data. samples of each batch depend on the batch_size parameter (actually set to 64).

I used to save mel plots, alignments and griffin lim inverted wav of the first data in the batch N of training steps where N is a multiplier of 100 (N = k * 100). in this latest commit (0330bd0161e4530acbb7bc2db00a1db95b2dc107) I changed N to be a multiplier of 500 and only save summaries to tensorboard each 100 steps.

About the ground truth frames, I find it very weird that your predicted mels have different number of frames from the ground truth wavs considering the TacoTrainingHelper stops the decoding exactly when ground truth is finished. But out of curiosity, are you using a different preprocessing than ours? Are you actually talking about difference in number of frames when doing a Natural synthesis? That would depend on how well your model learned to output .

It is natural that the model sometimes outputs negative values, especially if you do not impute finished decoder steps. This is due to the fact that the model is making predictions using a Linear projection layer with no restriction of the outputs, values can be negative.

Additionally, in our feeder.py we are explicitly setting the padding frames to be -(hparams.max_abs_value + .1), so if you are using a different preprocessing and your lowest mel value is 0, then please set symmetric_mels = False in hparams.py (it is by default set to True), padding values will become -0.1.

Hopefully this answered your questions? If there is anything else I can assist you with, please let me know.

MXGray commented 6 years ago

@Rayhane-mamah Awesome work! Just wanted to kindly ask when you'd be able to implement wavenet training and synthesis? :) I have decent GPU resources available right now, and I'd love to train both the feature prediction model and the wavenet vocoder ...

Rayhane-mamah commented 6 years ago

Hello @MXGray, thanks for reaching out.

I am working on it, it shouldn't take longer than a week. I'll try to take care of it this week end if I find the opportunity.

In the meantime, you can start by training the feature prediction model and doing GTA synthesis.

MXGray commented 6 years ago

Thanks, @Rayhane-mamah! Yeah, started with the feature prediction model a few days ago. Looking forward to wavenet vocoder training and synthesis. :)

keremsozugecer commented 6 years ago

Hello @Rayhane-mamah!, Thanks for this, very helpful, with some minor issues, I was able to get it up and running on 1 GPU. It looks like it's not working on multiple GPUs, is that correct? Or did I miss something? I tried running on a 4 GPU machine and only 1 was being used. Thanks!

Rayhane-mamah commented 6 years ago

Hello @keremsozugecer, Thanks for reaching out!

Yeah I actually have not thought about adding a multi-GPU support :) I will add it for the next commit so stay tuned. Keep in mind that I do not have multiple gpus so you'll have to provide feedback to let me know if things are working properly :)

keremsozugecer commented 6 years ago

@Rayhane-mamah!, thanks! we would be happy to provide feedback...

Rayhane-mamah commented 6 years ago

Quick notes about (d28cfa9a77afc87902100bd5b2113fbb8541227e):

Please note that it is essential you restart the preprocessing in order to train a new model!

atreyas313 commented 6 years ago

@Rayhane-mamah hi,Thank you so much for sharing your work and also thanks for previous comments. I am very eager to tacotron. Unfortunately,I have n't decent GPU resources available right now, Is it possible for you to provide pretrained model?

Rayhane-mamah commented 6 years ago

Hello @atreyas313, thank you for reaching out!

I actually have few pretrained models ready for upload locally, I just want to make sure I make the optimal one before releasing it to public.. It should take a couple more days so stay tuned :)

First pretrained model will be trained on Lj speech dataset so I recommend you download the dataset and run the preprocessing to be able to generate GTA samples in case you want to train a Wavenet vocoder after this project.

If you encounter any problems, please let me know!

maozhiqiang commented 6 years ago

@Rayhane-mamah hello ! I training on d28cfa9 ! but the learning is become 0 forever screenshot from 2018-04-18 19-17-27

Rayhane-mamah commented 6 years ago

Hello, this is not a 0, it's actually constant = 1e-3 until step 50k, tensorboard shows it badly tho.

On Wed, 18 Apr 2018, 12:34 maozhiqiang, notifications@github.com wrote:

@Rayhane-mamah https://github.com/Rayhane-mamah hello ! I training on d28cfa9 https://github.com/Rayhane-mamah/Tacotron-2/commit/d28cfa9a77afc87902100bd5b2113fbb8541227e ! but the learning is become 0 forever [image: screenshot from 2018-04-18 19-17-27] https://user-images.githubusercontent.com/23501322/38929631-8bd2e20c-433f-11e8-9560-6b1bab882737.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-382356547, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwBA2Wn7ikOUCotvifx_sDJnwqYi8ks5tpyTegaJpZM4SSQwC .

maozhiqiang commented 6 years ago

@Rayhane-mamah thanks you! I'm still in training!

jyegerlehner commented 6 years ago

@unwritten @Rayhane-mamah I get alignment by iteration ~8000 with outputs_per_step=1. All other hyper params the same as master except batch size=32, which was required to avoid out-of-memory on a12GB GPU. step-6000-align step-7000-align step-8000-align step-9000-align

Rayhane-mamah commented 6 years ago

Hello @jyegerlehner, thanks for your contribution!

I am actually aware of this, I reported it here. I did however find out that model tends to move forward a little faster than ground truth (when synthesizing naturally without teacher forcing) which makes the model read sentences faster than ground truth.I am trying to figure out why this is happening with outputs_per_step=1.

I'll tell you know how it goes :)

jyegerlehner commented 6 years ago

I am actually aware of this, I reported it here.

Oops, sorry, I missed that.

SynthAether commented 6 years ago

@Rayhane-mamah thanks for this repo, good to see it keeps getting better and better.

@jyegerlehner good to know you got an alignment at 8K steps, that was fast. A question: was this trained on LJ-Speech? What batch size did you use to get this running on your machine?

I am using the latest update from April 20 and set batch size to 16, so far no alignment after 25K.

jyegerlehner commented 6 years ago

@shaunmayberry Yes LJ dataset. My batch size was 32. With batch size 64 I got OOM errors (12GB GPU). Possibly those latest code changes broke something? I'll pull the latest master and go fire up an instance on the other machine starting from ground zero and see if/when I get good alignments.

SynthAether commented 6 years ago

@jyegerlehner thanks for your reply. I went to an earlier code dating from April 18 and I was able to set the batch size to 32 without OOM issues. I now got an alignment at 9K steps.

So possibly something different in the latest code from April 20 caused not obtaining an alignment, or I didn't wait wait long enough, I stopped it after 30K.

Rayhane-mamah commented 6 years ago

@shaunmayberry, it is all due to outputs_per_step I think, reducing it consumes more memory but increases performance.

By the way, alignments does not seem to get learned with a batch size lower than 32.. for batch size 64 and outputs_per_step=5 you can even get alignments at step 1k.

I would personally recommend to find the lowest outputs_per_step that supports batch size of 32. Outputs_per_step=3 should work with 8Gb of memory.

If I ever fins a way to reduce memory usage I'll update the repo.

Thank you for your contribution!

On Mon, 23 Apr 2018, 06:41 shaun, notifications@github.com wrote:

@jyegerlehner https://github.com/jyegerlehner thanks for your reply. I went to an earlier code dating from April 18 and I was able to set the batch size to 32 without OOM issues. I now got an alignment at 9K steps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-383460390, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwOCOiISJUoY9GsKgMe0FJZedzOpTks5trWl4gaJpZM4SSQwC .

jyegerlehner commented 6 years ago

By the way, alignments does not seem to get learned with a batch size lower than 32..

I'm at 10K steps and there's no hint of alignments. This is current master with no changes. I suspect something was broken in the last set of changes.

Rayhane-mamah commented 6 years ago

Latest commit works fine, I am currently running it..

Depending on model initial states, alignments might show up a bit late (I once had them around 20k). I will think about adding some seeds for future versions :)

jyegerlehner commented 6 years ago

You are right. Alignments showed up in this run around 20K.

unwritten commented 6 years ago

@Rayhane-mamah, do we have any data that compares the impact on voice quality of different reduce factors? or reduce factor 1 is better?

thanks

Rayhane-mamah commented 6 years ago

Hello,

I have not conducted the proper experiments and plots for a comparison but just by a naive thinking, it should be easier for the model to try and predict only the next frame than trying to predict 3, 4 of 5 frames at the same time especially if it's a transition state between two characters/tokens.

The overall quality of r=1 is however much better and less noisy than r=5. But considering that we train our model using MSE, spectros will always be somewhat blurry, so there will always be a slight noise to be suppressed using Wavenet.

If I do however make an in depth study and experiment I'll let you know.

Note: r=2 should be better than r=5 too.

On Tue, 24 Apr 2018, 08:42 michael lau, notifications@github.com wrote:

@Rayhane-mamah https://github.com/Rayhane-mamah, do we have any data that compares the impact on voice quality of different reduce factors? or reduce factor 1 is better?

thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-383835245, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwL_jIi0nxWT1w7dCvuTXZ1M2KpVqks5trtdpgaJpZM4SSQwC .

Ondal90 commented 6 years ago

@Rayhane-mamah, thanks for this repo. I am currently working r=1 model on the Korean dataset. I'm not sure why the model tends to move forward a little faster than ground truth. But my model seems to be solved by applying a zoneout to the decoder RNN(LSTM layers) in inference time too.

Rayhane-mamah commented 6 years ago

Hello @Ondal90 thanks for reaching out!

Great catch! but setting zoneout at inference time causes prosody and audio quality to become much worse does it not? Also it doesn't seem right to use zoneout at inference considering it's a regularization technique..

It is however nice to know that it is related to zoneout in some way, and will for sure help us improve the model. Thank you very much for this information!

My first thought is that zoneout at inference time causes the decoder to sometimes keep previous hidden states and cell states causing the RNN to make same prediction a few times consequently and thus slowing speech speed. That's a personal interpretation, I will have to look into that in depth.

unwritten commented 6 years ago

@Rayhane-mamah why use max_abs_value for normalization, why max_abs_value is 4?

Rayhane-mamah commented 6 years ago

@unwritten, it is mainly to change the output distribution to be wider which gives in my opinion more detail for the model.

Here is a deeper explanation of the reason. I'm planning on testing the model with default normalization too, but hey I only got one machine..

Ondal90 commented 6 years ago

@Rayhane-mamah The author of ZONEOUT paper released his code. ZONEOUT: REGULARIZING RNNS BY RANDOMLY PRESERVING HIDDEN ACTIVATIONS https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py

In this code, he uses previous information at inference time.

// Inference time new_state = state_part_zoneout_prob state_part + (1 - state_part_zoneout_prob) new_state_part

I wonder why zoneout_LSTM.py uses c = c_temp, h = h_temp at Inference time. I think this code needs to be changed like this.

// Inference time h = h_temp (1-self.zoneout_factor_output) + h_prev self.zoneout_factor_output c = c_temp (1-self.zoneout_factor_cell) + c_prev self.zoneout_factor_cell In my test, it works well... Please let me know if i'm wrong.

candlewill commented 6 years ago

@Ondal90 Interesting findings. Does your modified code work better than the current version? And after changed the code, do we need to re-train the model?

Ondal90 commented 6 years ago

wavfile_modified.zip

@candlewill In my experiment, modified code eliminates the problem that it read sentences faster than ground truth. No. It only change at inference time. So, you can use existing trained models.

candlewill commented 6 years ago

@Ondal90 Thanks. It really solved the speech speed problem when using mel spec. However, the speed of wave from linear spectrogram would be still very fast.

Ondal90 commented 6 years ago

@candlewill Congratulations. I don't know because I only used Mel spectrogram. If you use WaveNet, It doesn't matter. However, I wonder why not..

Rayhane-mamah commented 6 years ago

@Ondal90, Once again, thanks a lot for these contributions. I suppose I somehow managed to misunderstand the paper :)

Your changes will be applied for future repo version, thank you very much for your help!

@candlewill, Don't forget to make the appropriate changes in the post processing network as well. (We use same architecture as encoder for post processing net, so if my zoneout implementation is wrong, it should affect linear outputs as well..). You tell me how it goes ;)

jyegerlehner commented 6 years ago

@Ondal90 and all,

So do you think the training time binary mask scheme that this project uses:

h = binary_mask_output * h_prev + binary_mask_output_complement * h_temp

is equivalent to the ZONEOUT author's code:

new_state = (1 - state_part_zoneout_prob) * tf.python.nn_ops.dropout( new_state_part - state_part, (1 - state_part_zoneout_prob), seed=self._seed) + state_part

To me at first glance it looks like there's at least a difference of a factor (1 - state_part_zoneout_prob). Though still trying to wrap my head around what each is doing exactly.

candlewill commented 6 years ago

New samples (Chinese) are here based on @Ondal90 mentioned code : https://goo.gl/YVDBdX

neverjoe commented 6 years ago

I find if outputs_per_step < 3, it's hard to learn aligment, @candlewill your outputs_per_step >=3 ?

candlewill commented 6 years ago

@neverjoe No. I use the default value outputs_per_step = 1.

Rayhane-mamah commented 6 years ago

@jyegerlehner, considering tensorflow's dropout, Zoneout in author's code and this project are the exact same during training (dropout scales the inputs to keep by keep_prob). The other "difference" in author's code is that instead of making a mask and its complementary, they use their tricky mathematical approach which gives same results.

I believe the main thing @Ondal90 wanted to bring attention to is what is referred by "As in dropout, we use the expectation of the random noise at test time" in Zoneout paper. I actually never paid attention to that detail..

Oh another "mistake"(?) I was also doing, is that zoned out states are only meant to be propagated internally to the next state of RNN, in this project I am also using it as the RNNCell output.. with these in mind I will correct the zoneout.

Great work @Ondal90, thanks for sharing!

unwritten commented 6 years ago

@Rayhane-mamah

in your code: bw_cell and fw_cell are the same self._cell? should there're 2 cell: bw and fw cell?

class EncoderRNN: .....

Create LSTM Cell

    self._cell = ZoneoutLSTMCell(size, is_training,
        zoneout_factor_cell=zoneout,
        zoneout_factor_output=zoneout)

def __call__(self, inputs, input_lengths):
    with tf.variable_scope(self.scope):
        outputs, (fw_state, bw_state) = tf.nn.bidirectional_dynamic_rnn(
            self._cell,
            self._cell,
            inputs,
            sequence_length=input_lengths,
            dtype=tf.float32)

        return tf.concat(outputs, axis=2) # Concat and return forward + backward outputs
neverjoe commented 6 years ago

@unwritten I think it should be different.

Rayhane-mamah commented 6 years ago

@unwritten I also noticed that few days ago, separating cells didn't make any noticeable changes on memory, speed, or even training/loss. Training a model using the original code, then separating cells using the saved model checkpoint causes tensorflow to generate error of missing parameters. I am wondering how a single cell managed to make bidirectional representation of the inputs up until now.. Actually, we are using two cells, but they share parameters.. If someone has seen something similar somewhere, I would love to know the explanation!

In any case, creating two different cells is usually how we perform a bidirectional reading of a sequence, Bahdanau also mentioned that cells in both directions should be independent in his attention paper. So please make sure to create a _fw_cell and a _bw_cell separately. Thank you for your remark! :)

ferintphilipose commented 6 years ago

Hi @Rayhane-mamah , I am conditioning my WaveNet on log mel values computed as follows: def get_spectrograms(sound_file): y, sr = librosa.load(sound_file, sr=16000) stft = np.abs(librosa.stft(y, n_fft=2048, hop_length=200, win_length=800, window=scipy.signal.hanning, center=True))**2 mel = librosa.feature.melspectrogram(S=stft, sr=16000,n_fft=2048, n_mels=80, fmin=125, fmax=7600) mel =np.log10(mel.T + 1) mel = mel.T.astype(np.float32) # (T, n_mels) return mel The data set I am using is from VCTK corpus.

The trained model works fine enough during evaluation with the mel spectrogram computed in this way. However it is not working with the mel spectrogram computed as per the method used in your data processing script. I had assumed this would be due to normalization. But then even after de-normalizing the mel values it doesn't work for the conditioning. It would be really great if you could guide me as to how alter the data pre-processing part in your script to obtain predicted logmel values to be that as obtained with the above method. Thanks.

HallidayReadyOne commented 6 years ago

Hi @Rayhane-mamah, After the model training is completed, I synthesize the same sentence multiple times using this model. But the result of each synthesis is not exactly the same. Although the synthesized speech sounds similar, there are some differences in the waveform. Do you know why this happens? The griffin lim algorithm may be one reason. Anything else? Thanks!

Rayhane-mamah commented 6 years ago

@ferintphilipose Hi, From what I understood I believe you are looking for a feature interpolation like the one used by r9y9 ? I also noticed that others parameters (fft_size, hop_size, etc.) are different in your preprocessing which can also cause problems.

Wavenet cannot recreate speech correctly if trained on mels of some scale and tested on other mels with a different distribution. My advice? Train the Tacotron on the same mels you used to train the WaveNet. Things should go smoother.

@HallidayReadyOne hello! The variation in synthesis is due to the use of pre-net dropout even at inference times. It can be modified within these lines of code. I am confident that the following lines in the T2 paper refer to using dropout at inference times: "In order to introduce output variation at inference time, dropout with probability 0.5 is applied only to layers in the pre-net of the autoregressive decoder." Let's say it's an "extra" that gives the model some sense of "creativity".

ferintphilipose commented 6 years ago

@Rayhane-mamah , First and foremost thanks for your input. I was looking for a way to train the Tacotron using log mel values computed via my method. but then I am bit confused as to how to run your Tacotron script with them. For instance , when I tried to turn off the silence trimming, audio re-scaling and signal normalization, while preprocessing the data, and using the computed data set, I ran into exploding loss error. It would be great if you could shed some insight into this problem. Thanks once again.

HallidayReadyOne commented 6 years ago

@Rayhane-mamah Thank you! My problem, I ignored this detail. "’ def _griffin_lim(S): ... angles = np.exp(2j np.pi np.random.rand(*S.shape)) ... " Should this be a reason too?

Rayhane-mamah commented 6 years ago

Hello again @ferintphilipose, sorry for the late reply. I personally recommend that you use our preprocessing while only changing parameters in hparams.py If however you want to use your own preprocessing, keep in mind that taking off signal normalization will cause loss explosion in this repository because I set a maximum loss value as 100 which can occur when data is not normalized (because mels value will range from -100 to 20 -> squared error will be much bigger).

audio re-scaling is mainly for wavenet, I believe it is necessary to keep to make sure wavs are in [-1, 1].

@HallidayReadyOne, sorry for the late reply. I believe that does not affect the variation in output generation considering that griffin lim is an iterative algorithm that converges after few iterations (60 in our case), initial value will probably have no great effect. It's a little like gradient descent, where you pick random initial values but usually, when the function is convex, you converge to the same minimum.

Finally, @ferintphilipose, wavenet is coming in 10 mins, maybe seeing the entire Tacotron-2 project in one piece can help you solve your issue. If you need anything else, please let me know!