Understanding blocks for Keras implementation

Ananas120 commented 5 years ago

Hi, First, i have a little question on my training alignment result, is it normal ? I have this king of graph at step ~3k and the loss decreases and quality of audio is not so bad (a bit metallic) but after only 8k steps i think it's not bad ?

step-8000-align

I tried to find Keras implementation of Tacotron-2 + Wavenet but i only find one (without Wavenet) and i am not sure it's a good implementation, so i would try to do it to better understand this model, can anyone help me ? I don't understand how to implement and what do certain blocks...

Encoder

input : text sequence [batch_size, max_len_char] output : text encoded [batch_size, max_len_char, bidirectional_lstm_units] ? Layers : a block of Conv1D layers with Dropout and, at the end, a Bidirectional-LSTM

Attention

Input : text encoded (from encoder) + ? Output : ?? ?? not understand what

Decoder : Inputs : only attention output or attention output + text_encoded ? Output : linear_spectrogram (first) + mel_spectrogram (second) + ? Layers : a block LSTM + Dense (spectrogram prediction) + block of convolution + LSTM (?) + Dense (mel prediction) ?

Wavenet : i don't know yet i will take a look this day's

Thank you and if you wants the final project i can send you (if i achieve it :') ) !

ps : sorry for my English if i did mistake, don't hesitate to correct me so that i learn ;)

freds0 commented 5 years ago

@Ananas120 in which repository did you find an implementation of Tacotron-2? I just saw this:

https://github.com/Stevel705/Tacotron-2-keras

but I realized that this is the Tacotron-1. I am also studying and trying to understand the Tacotron implementations.

Ananas120 commented 5 years ago

Yes it’s the first link but the implementation seems incorrect... i have 400M parameters with the implementation x) I think the part encoder-decoder is not difficult to understand and the implementation in the Tacotron-2-keras repo is not bad but not sure of the Attention mecanism and the output layers. I wil try to understand what the model takes as inputs and outputs during training (if it produces all mel/linear-spectrogram in 1 time (Dense and Reshape) or if it predicts 1 time-step of spectrogram and pass all precedents outputs as next decoder-inputs...

Good luck and if you find a good attention implementation in Keras, can you share it plizz ?

Ananas120 commented 5 years ago

Sorry for my mistake i am a newbie in github it’s my first post ^^' @freds0 is there many differences between Tacotron 1 and 2 ? Couldn’t we reuse parts of Tacotron 1 to build Tacotron 2 (without Wavenet at firrst) ?

Ananas120 commented 5 years ago

What is ‘self.keys’ in the LocationSensitiveAttention ? It is used as W_key in _location_sensitive_score, is it the encoder output ? I think after the attention mechanism, i finish to code the Tacotron-part of the model.

For the training,how batches are made ? I think the input_mel is the x previous ‘predictions’ (for training, the real spectrogram) but the batch is 32 and all input_shape must have same dimensions to form a batch (in keras, is this the same in tensorflow ?) then, is it 32 files at the same time_step (then same number of previous time_step) or is it 1 (or 32) files with random time_step and padded at max_batch_timesteps ? Or all batches have same max_timesteps and all are padded ? Thank you !

freds0 commented 5 years ago

Sorry for my mistake i am a newbie in github it’s my first post ^^' @freds0 is there many differences between Tacotron 1 and 2 ? Couldn’t we reuse parts of Tacotron 1 to build Tacotron 2 (without Wavenet at firrst) ?

In Tacotron-1, there is the CBHG module, which in Tacotron-2 was replaced by a stack of 3 convolutional layers. There are other more specific differences, which can be seen in the paper. Stevel705 repository is based on Packt's "Hands-On Natural Language Processing with Python" book. Unfortunately I have not yet had access.

freds0 commented 5 years ago

What is ‘self.keys’ in the LocationSensitiveAttention ? It is used as W_key in _location_sensitive_score, is it the encoder output ? I think after the attention mechanism, i finish to code the Tacotron-part of the model.

For the training,how batches are made ? I think the input_mel is the x previous ‘predictions’ (for training, the real spectrogram) but the batch is 32 and all input_shape must have same dimensions to form a batch (in keras, is this the same in tensorflow ?) then, is it 32 files at the same time_step (then same number of previous time_step) or is it 1 (or 32) files with random time_step and padded at max_batch_timesteps ? Or all batches have same max_timesteps and all are padded ? Thank you !

I did not analyze the code so deeply. But I remember there is a zero padding in the input files, so that all wav files are of the same size, filling in the spaces with zero. In the Stevel705 code, this is done in preprocessing. Also in preprocessing it generates a file containing only the last frame of the previous step, which is sent as input to the decoder.

Ananas120 commented 5 years ago

@freds0 it’s okay i finally understand all blocks and how training works ! I just wrote a LocationSensitiveAttention layer (in keras) based on this code and it works (i don’t know if the result is the good one but it compiles successfully) (i posted it on my github) now. i try to code the rest of the model but it seems basic (current layers) except for ZoneoutLSTM but i think i can put LSTM at the to test the model

freds0 commented 5 years ago

@Ananas120 great! May I take a look on your code?!

Ananas120 commented 5 years ago

@freds0 yes of course i will post it on my github in a few minutes but i have a problem and i don’t know why In the attention mechanism, (call to _location_sensitive_score(...)) W_query, W_fill and W_query have dimension (batch_size, ?, attention_dim) but in my code only W_fill has attention_dim and, to my understanding, it’s logic because :

W_query comes from processed_query and processed_query comes from DecoderRNN output and this is (logically) of shape (batch_size, ?, hp.decoder_lstm_units) (because the size of the cells are hp.decoder_lstm_units) and query_layer is None (in attention_mechanism) so Processed_query = self.query_layer(query) if self.query_layer else query logically returns query without changing it’s dimension
W_keys comes from self.keys and self.keys comes from self.values (because memory_layer is None) (cf tensorflow/.../attention_wrapper.py) and self.values comes from memory and memory is encoder_outputs with shape (batch_size, max_time, hp.encoder_lstm_units) (logic because it’s the output of the decoder and it’s shape is not changed (or if it is, where it is changed ??)

The solution i can find is to make all these dimensions equal in hparams but it seems bad ^^' anyone has an idea of where the dimensions of query and memory are changed to match with attention_dim ? Thank’s a lot !

Ananas120 commented 5 years ago

Here is my repo : https://github.com/Ananas120/Tacotron2_in_keras And here is my google colab test file : https://colab.research.google.com/drive/1qE9WnKm4LLrXwihQzGNApfWXAlrAWcZ0 new_layers 1, 2, 3 it's just tests for LocationSensitiveAttentions and new_layers 4 it's the final working implementation in tacotron 2 model part, it's all my code from file (github) put here (easier to test and change it)

Ananas120 commented 5 years ago

Oh no i understand why ! query and memory layers are created by the BahdanauAttentionMechanism based on num_units !! i will change it and the code should work fine with that, yes °-°

Ananas120 commented 5 years ago

The model on my github repo compiles ! But with my implementation i have 12M parameters Wi th the original implementation, it says i have 29M parameters so i don’t know if my implementation is correct or if it’s because in my implementation num_outputs is 1 and the original i put 2 (i can’t put > 1 in my implementation, too difficult to make training data, i will perhaps add it in a future version)

freds0 commented 5 years ago

@Ananas120 thx! I'm testing a lot of code, but I haven't gotten deep to the source code yet. I will review your code, I think will help me.

Ananas120 commented 5 years ago

@freds0 I will improve my code in the next days, now i have well 28M parameters (it was just my AttentionMechanism which doesn’t take the rnn_cell weights but now i think the model is right. I do the prediction and training but i don’t know if i can make a callback (in keras) which use the prediction to plot alignments, i think i can do this with a custom training based on « train_on_batch » but more difficult so i will do this later (after basic training (fit_generator) Also i have another problem with the original model because the linear / mel spectrogram is a good representation but the inversion is not really good quality (little metallic) so i use a variant where i replace linear spectrogram by just the STFT (without pre-emphasize / magnitude computation / ...) it’s not as good as linear but easier to inverse and then the quality of sound is perfect and, at the end, it just changes the model should learn how to transform mel to stft and not mel to linear (which is stft plus other transformations) so... i will try and share results ;) I will update my github in 1hour so you can have the complete right model (i think i can achieve basic training and prediction tomorrow)

Ananas120 commented 5 years ago

The model is complete and it’s well compiling and running but the weights become ‘nan’ after 40-50 epochs and i don’t understand why... All the code is on my github

Ananas120 commented 5 years ago

@freds0 why did'nt you go into the code ? it's the best way to understand a model and rewrite it ! :p So my implementation is (i hope) finished but... i think my attention mechanism is really strange (here is weights after 150 step) x) (and it doesn't change) and mel spectrogram is -4 constant value (or +4) but i verify my inputs / outputs generated (by BatchGenerator) and they are ok so i don't understand why it does it alignments_0 my code is on my github (but not the main objec, if you want it, i can post it tomorrow) if you have an idea on what's wrong in my implementation ^^'

freds0 commented 5 years ago

Sorry @Ananas120, I started working on my own project, based on https://github.com/Stevel705/Tacotron-2-keras, and ended up running out of time. But your code is on my to-do list! In fact, my priority right now is to train a model with the dataset I have, but from all the code I tested, I still can't get quality samples.

Perhaps the 'nan' problem will be solved by normalizing the audio/spectrograms files.

Ananas120 commented 5 years ago

@freds0 no problem but you can’t get result with this repo ? It’s in tensorflow but you can easily use it with your dataset (i did it and get fun results) So my code is now fully working and i just have my first result (after only around 1200 steps) and it’s really promizing the voice is understandable but metallic (really like the results i get with the tensorflow implementation in fact) (i just replaced my ZoneoutLSTMCell by LSTMCell...) I will post my model in few hours and the full code this week-end (i must clean it before posting) The training is not as complex as in this repo but it’s working and i will improve it this week ;) I tried to improve the code on the github you use but it’s really simple and does’nt implement right the attention mechanism so not very useful :/ in fact the Tacotron-2 architecture is really classic (except wavenet part, of course) it’s just the attention mechanism that doesn’t exist as Keras core layers

Ananas120 commented 5 years ago

It was well my ZoneoutLSTMCell because in this implementation, ‘tf.nn.dropout’ is used and the argument is the ‘keep_prob’ (probability to keep element) and in ‘K.dropout’, the argument is dropout_prob so i had 1 - prob (0.1 in real case so i had 0.9 to drop)... but now it’s okay all my layers works but i have 3 « issues » :

my attention is never learned but after the same number of epochs, it’s learned with this implementation so... what’s different ?
the results are good but only when predicting with target as input but when i pass last outputs as inputs, it’s so bad (no voice in fact) so i think i pass the wrong « start sequence » (is there a specific « start of sequence » during training ? Now i put as first input a full zero line but is it necessary ? May i pass full constant value (where constant is - max_abs_value (if symmetric_mels) ? Or no specific start of sequence ?)
my model is really slow when generating from itself because i must pass all previous outputs to get new output because i can’t get attention state (i don’t know how to make ‘Recurrent layer’ with undefined state size) so my astuce is to use K.rnn(...) into a normal Layer instance but when i try to put last state as additionnal output, it doesn’t works (why ?) If anyone have time to look at my code, i will post it on my github in few hours... Thank’s a lot !

Rayhane-mamah / Tacotron-2

Understanding blocks for Keras implementation #407