CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.54k stars 8.79k forks source link

Pytorch synthesizer #447

Closed ghost closed 4 years ago

ghost commented 4 years ago

Splitting this off from #370, which will remain for tensorflow2 conversion. I would prefer this route if we can get it to work. Asking for help from the community on this one.

One example of a pytorch-based tacotron is: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

Another option is to manually convert the code and pretrained models which would be extremely time-consuming, but also an awesome learning experience.

ghost commented 4 years ago

Some other pytorch-based tacotrons are Mozilla's TTS and this repo from Tomiinek which does voice cloning with multilingual capability.

Does anyone know if the method proposed here is viable? Supposedly you can convert a tensorflow model to pytorch, and refactor the code in the process. https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28

ghost commented 4 years ago

In light of #53 we should try and use fatchord's tacotron (the one bundled with WaveRNN). Main advantages being code simplicity*, no gaps in spectrograms, and known vocoder compatibility. I'll give it a shot and try to integrate it with this repo.

*Cannot be understated. From an aesthetic point of view this is the nicest Tacotron I've ever seen. Will be a nice addition if we can get it to work.

ghost commented 4 years ago

@sberryman Did you ever figure out an answer to this question?: https://github.com/fatchord/WaveRNN/issues/139

If you can share anything you learned working with fatchord's tacotron it would save me a lot of time.

sberryman commented 4 years ago

I never pursued it when I didn't get a response. My assumption is that you have to concat the speaker embeddings to every time step. So basically you take the speaker embedding, expand a dimension unsqueeze(dim=0) and then duplicate it based on number of timesteps you are feeding forward for tacotron. So you concat that duplicated embedding to the spectrogram and that becomes the input to the network. I'm writing this as my brain has already shut down for the day but hopefully that is enough to get you started?

If not let me know and I can try and help some more.

ghost commented 4 years ago

Thank you for the prompt response, that gives me enough to work with for now. It looks like Corentin left plenty of clues in the code as to what changed for SV2TTS, like the following which describes what you mention. I find your explanation helpful. I'll let you know if I can't figure this part out.

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/models/tacotron.py#L151-L160

Edit: Reviewed all differences between Corentin's and Rayhane-mamah's repo and concluded this is the only significant one. Now to implement it in pytorch.

ghost commented 4 years ago

Peer review? Here's the updated code in the fatchord version of tacotron.py. That was the easy part... now to get training and synthesize to work.

sberryman commented 4 years ago

It looks like you added the speaker embeddings to the Encoder. I would have guessed you concat after the encoder. If you look at Corentin's code you'll see he concat's the speaker embeddings to the output of the encoder.

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/models/tacotron.py#L158

ghost commented 4 years ago

Thanks for pointing that out @sberryman ! Pushed a couple of commits that should fix it. It is now done at the very end of the encoder to avoid duplicating the code for both training and synthesis modes. I ran WaveRNN with the bundled pretrained model to confirm that the dimensions are still correct.

https://github.com/blue-fish/Real-Time-Voice-Cloning/compare/2f59298...blue-fish:4e8d89b

sberryman commented 4 years ago

@blue-fish exciting! I can't wait to hear how the training goes!

ghost commented 4 years ago

Here is the general plan:

Check for compatibility of fatchord synthesizer with our vocoder

  1. In the original WaveRNN code, modify the spectrograms to have a range of [-1, 1] instead of [0, 1]
  2. Save off a mel during generation with a WaveRNN pretrained model
  3. Run it through pretrained vocoder in Real-Time-Voice-Cloning repo to check for compatibility
  4. Implement any needed changes into the WaveRNN synthesizer code that we are importing.

Refactor code

  1. I don't have any training data in fatchord's preferred format, so we will make it accept the format in /SV2TTS/synthesizer.
  2. This is a large enough undertaking that everything else can be updated at the same time.

Train a new synthesizer model

  1. I will need help reviewing and modifying the hparams from fatchord's tacotron
    • Here are the current hparams, modified for 16,000 Hz. I may move this file up a level to have a unified hyperparameter set, at least for synthesizer and vocoder.
  2. See how well LibriTTS-based model in #449 works for cloning
    • If at least as good as the current model, then train on LibriTTS.
    • If not, then will train on LibriSpeech to remain consistent with the training process
  3. Test the pytorch-based model and adjust as needed
    • If voice cloning performance is roughly equal or better, then push the pytorch synthesizer to master and publish the new pretrained synthesizer model
    • If not as good, then push it to a branch of the main Real-Time-Voice-Cloning repo for further development if a few more iterations do not close the performance gap
ghost commented 4 years ago

I am taking a break from this. It's one thing to get it to work, but refactoring the existing code is a tedious chore, not to mention keeping the code quality on par with the rest of the repo.

For anyone who wants to attempt this, I suggest using the vocoder as a template since that's already fatchord-style.

ghost commented 4 years ago

Please see #472 for the pull request. There is still much work to do. I am training a model at 16000 Hz with fatchord's default settings, and have found that it does not work well with our vocoder, so I expect that needs to be retrained too. Will release some results when they are more respectable but have already proved to myself that voice cloning will work.

ghost commented 4 years ago

Update: after fixing a bug that I introduced, the torch-based synth works well with the pretrained vocoder.

The output mels from tacotron are scaled to max_abs_value as a result of preprocessing, so they will have a range of [-4, 4] by default. Speech is still intelligible with the erroneous clipping but very degraded. The max_abs_value is used to normalize the mels in the functions where they are consumed.

ghost commented 4 years ago

The tacotron loss function in the WaveRNN repo does not match the source paper 1703.10135 (see section 3.4 and figure 4). There is a postnet which is supposed to predict linear spectrograms from the mels, but fatchord keeps it in mel scale so it doesn't do much. See https://github.com/fatchord/WaveRNN/issues/123

I tried to work around this by converting the mels back to linear but it is a lossy transformation. A better way of doing it is to preprocess the training wavs into linear spectrograms for the loss calculation. Something to experiment with at a later date, it already works quite well without this feature.

ghost commented 4 years ago

First audio samples with pytorch synthesizer: samples_tf278k_pt67k.zip

Samples for pytorch synth, trained on LibriTTS for an equivalent of 67k steps with batch size of 36. (My actual batches were smaller due to limited GPU memory.) I also synthesize the same utterances with tensorflow (LibriSpeech_278k) for comparison. I use a random seed of 1, enhance vocoder output to trim silences (pytorch did not need it but tf did). The original vocoder (428k) is used in all cases.

This a 16kHz model using fatchord's default settings which are optimized for single speaker. The model is 140 MB (half the size of the current), most of that retaining the optimizer state so training can be stopped and restarted safely. (Thanks to @CorentinJ for finding this issue in https://github.com/fatchord/WaveRNN/issues/87#issuecomment-499456228 and @TheButlah for implementing it.)

For lack of a better term, the tensorflow versions have more "depth" to the voice. Otherwise I find the result to be similar. Neither model works particularly well on these speakers (VCTK p240 and p260).

The pytorch model fails to align for longer inputs, resulting in gaps and stuttering speech. We will see if additional training fixes it. Edit: Still the case with 92k equivalent steps. I have discarded this model and started training a new one.

ghost commented 4 years ago

Restarted training with adjusted hparams to make the tacotron model more similar to this repo. Switched back to Corentin's version of symbols.py including the use of the EOS symbol "~" even though fatchord's tacotron doesn't require it for a stop prediction. See here for more details: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-671071445

The pytorch fork is cleaned up and usable for anyone who wishes to test it. If anyone can contribute hardware to train a model that will help bring this to fruition faster.

Use the following command to clone it, or download a zip file.

git clone -b 447_pytorch_synthesizer --depth 1 https://github.com/blue-fish/Real-Time-Voice-Cloning
ghost commented 4 years ago

The synthesizer gaps (#53) will be mostly fixed by switching to pytorch. I think it is an artifact of the training data more than the implementation. Reducing max_mel_frames to 500 has made a big impact. Also now one can synthesize really long texts if the attention keeps up. (fatchord stops synthesizing whenever the synthesizer predicts empty frames, so the model may "quit" in the middle of a text)

Rate of speech (#347) may be an issue but it at least seems to be consistent and affects short text inputs too. (Edit: I also think this is from the training data which has similar speech rate.) We might add an option to the toolbox UI to postprocess the output with a speed multiplier that doesn't affect pitch.

At my level of compute power, punctuation causes more problems than it solves, but I haven't given up yet. My model is currently on par with the tensorflow model if I clean the input of all punctuation. Maybe slightly worse in terms of voice similarity and quality (it's quite similar to the samples previously shared), but better in terms of gaps and other weird behavior. It still has a few more days to go before completing its training schedule.

Ananas120 commented 4 years ago

Hello @blue-fish , I just discover this repo and see your differents posts I can’t help you to implement it in pytorch because i code in tensorflow 2.0 but if you want i already recode Tacotron-2 in tensorflow-2 based on the NVIDIA/DeepLearningExample repo and re-train it in french (with transfer learning, i converts weights from the NVIDIA pretrained model) I also have hardware limitation with my memory GPU, a little trick i used was to train on whole spectrogram (of any length) but by part For example a spectrogram of size 500, i run into a for loop where i optimized the model every 50 steps and keep track of last rnn state to pass it for next 50frames training step (i know, it’s not a good idea but it solved my memory limitation)

Tomorrow i will try to re-implement the encoder part in tf2.0 and convert weights (i made scripts to convert weights from pt to pytorch and pytorch to tf if it can help you)

For the tacotron-2 part, i will see how to do because i would like to re-train it in french but i have only 1 GPU and Corentin trains it in 1 week on 4 GPU with more memory than mine so... If you have any implementation and checkpoint of the model in pytorch, it can be easier for me to convert it to tf2.0 (i don’t understand tf1.x and my implementation seems to be a bit different of the implementation of this repo)

For the vocoder, i use the Waveglow model from NVIDIA which is really good (i find) because it converts spectrogram (in any language, etc) so if the quality of the spectrogram is good, the audio will be good too (without re-training) (for example, audio —> mel —> waveglow gives an audio that sounds exacly like the original one) The only problem is that it is trained on 22050Hz and not 16khz

ghost commented 4 years ago

For the tacotron-2 part, i will see how to do because i would like to re-train it in french but i have only 1 GPU and Corentin trains it in 1 week on 4 GPU with more memory than mine so...

Hi @Ananas120 , I used to think synthesizer training was impractical on my basic GPU until I tried it. It is slow but as long as you do not require a perfect model it is quite usable after 2 or 3 days of nonstop training. I use short utterances mainly because the prosody is more consistent, otherwise the model makes long gaps in the middle of sentences when synthesizing. Smaller memory footprint and faster training are nice side effects. For smaller datasets you may have no choice but to split utterances like you are doing.

For the vocoder, i use the Waveglow model from NVIDIA which is really good (i find) because it converts spectrogram (in any language, etc) so if the quality of the spectrogram is good, the audio will be good too (without re-training)

Yes, a multi-speaker vocoder can also be thought of as a universal vocoder. It generalizes to new speakers and even languages without retraining. That is also the case with the pretrained WaveRNN from this repo. In my opinion, the spectrogram quality is not that good and would not benefit from a better vocoder. I might try out Waveglow if we train a 22,050 Hz synthesizer model (someone is working on it in #449).

You should read the new paper on the HooliGAN vocoder by Corentin's colleague fatchord: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-670367298 Faster vocoding + higher quality have me very intrigued. I don't think I can implement it properly though so I will wait for someone to write an open-source version.

i made scripts to convert weights from pt to pytorch and pytorch to tf if it can help you

Please share those. I might convert the weights from the tensorflow synthesizer over to pytorch in case my model doesn't work out.

If you have any implementation and checkpoint of the model in pytorch, it can be easier for me to convert it to tf2.0 (i don’t understand tf1.x and my implementation seems to be a bit different of the implementation of this repo)

You can find a working pytorch implementation on the 447_pytorch_synthesizer branch of my fork. The pretrained model is still in work.

ghost commented 4 years ago

@Ananas120 You might also be interested in my tf2.0 fork, on branch 370_tf2_compat

I ran the tf1.x --> tf2.0 conversion scripts and fixed most of the obvious problems. I spent quite a few hours on the conversion but couldn't get past an error message (look at history of this link: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/370#issuecomment-648798434). If you can figure it out, it could make your job easier to get a tf2.0 version of the synthesizer.

The hard part is not making a tf2.0 tacotron (there are plenty of those out there), but rather the integration of tacotron with the rest of the repo. There are quite a few things to consider if you care about preserving the functionality and UI.

Ananas120 commented 4 years ago

@blue-fish thank you for the informations, i will post my convertions code today on my github (i will say you when they are posted)

In fact, my first target is to integrate this repo inside my own project (i copy many existing models and make experience with them to learn how they work and make fun applications of them)

[Moderator note: Moved question about speaker encoder to #484]

Ananas120 commented 4 years ago

@blue-fish it’s done, i push my convertor code on my github It’s not perfect and is mostly based on the layer weights order but in most case it is good (i transfered tacotron-2 weights from NVIDIA to my tf model with this code but i had to create layers in the same order as the pytorch model)

ghost commented 4 years ago

@Ananas120 Thank you! So it looks like after constructing the pytorch model with layers in the same order and size as the tf model, the weights should transfer over.

I'll need to study the Rayhane Tacotron-2 (tensorflow) and Fatchord Tacotron-1 (pytorch) and see how far off they are in terms of model structure.

Ananas120 commented 4 years ago

If you want i can share my Tacotron-2 tf2.0 implementation I think they are really similar but the big difference is that he uses Zoneout-LSTM and me normal LSTM (the difference is essentially a dropout, if i well understand the layer) so no matter in term of weights I will invastiguate the variables in the checkpoint to see if i can use them in my model or if there is really a difference in the architecture...

Edit : i begin to train my siamese network and discover that when i l2-normalize my embedding before calculating distance, the loss is 0.6933 at around the 500th step and doesn’t change till the 100th step where i stopped training (no interesting) Without normalization, the loss drops from 7 to 1 (from step 1 to 1000) so much interesting but i don’t know if it can serve as input of the tacotron model (because of the non-normalization)... Re-Edit : i try with normalization but with 64 embedding size and have good results, loss is actually at 0.33 after only 3k steps (1 epoch is 2k step) on a mixt of librispeech (en) and common-voice (fr) datasets)

I will try to convert the encoder (3-layer RNN) in tf and convert weiths to see if i can obtain interesting results with that architecture and my siamese architecture and also compare loss and embeddings plot between my model and the 3-layer RNN encoder

Edit 2 : i converted the RNN encoder to tf2.0, it was really easy just a little trick to use my converter weights script was to rename the lstm weights (because pytorch use lstm.weight_xx_yy) to name weights of stacked lstm but my script get layer name by removing last part (all after the last ‘.’) so i had 12 weights for « lstm » then i renamed lstm_0 the 4th weights, lstm_1 (4 to 8) and lstm_2 (8 to 12) and all works fine ! I made test and i have a difference between the 2 models of 5e-6 (max) and 5e-7 (mean) for input np.ones((10, 1000, 40)) so really good !

Ananas120 commented 4 years ago

First of all, my GE2E loss implementation is really bad because it uses far too much memory... It’s because tensorflow graph accepts only tf operations and neither numpy function neither indexing so i used matrix tricks with mask to use the write centroid but i then have matrix with shape (N, M, N, embedding_dim) 3 times which is really expensive especially for 256-embedding... if you find a better open-source implementation, let me know please

But i compared my siamese encoder model and the 3-layer RNN encoder and it’es really fun because i have a 2-times better loss (0.8 for me and 1.5 for the RNN) (i calculated the loss as the mean of categorical-crossentropy of the similarity matrix)

ghost commented 4 years ago

For anyone who's following, I am still disappointed in the model that has been training (currently at 78k steps, equivalent batch size 36. It still has these issues:

  1. Voice quality is not as good
  2. Gaps and voice drops off when synthesizing longer utterances (a semi-frequent occurrence)
  3. Does not handle punctuation well

If it is a matter of additional training (which we do not know for sure), it will take over a week of nonstop training to match the LibriTTS_200k model in #449. As much as I want to see #472 merged it can wait until we have a good pretrained model to ship with it.

To keep myself occupied while it is training I am going to test out a pytorch-based Tacotron2 and see if I can convert the weights from the existing synth over.

Ananas120 commented 4 years ago

Tomorrow i will make the pipeline to train the model too (with my architecture tf2.0 code) and my siamese encoder and will say you if i have better results

To have better results for you you could try a partial transfer-learning from pretrained model like the NVIDIA implementation in pytorch (i said partial transfer because all layers will not have same shape because of the additionnal embedding vector) but as i tested it for other models, a partial transfer is good too (the code for tensorflow partial transfer is in my converter code in my github, you can easily adapt it to pytorch i think)

If you prefer use a tf implementation, i can share you my implementation in tf2.0 and you can transfert weights from the pretrained NVIDIA model

Ps : you use 32 batch-size :o what is your GPU memory and the max_length of the mel spect ? With my implementation i can’t do batch 32... (i will invastigate because i think the tensorflow CuDNNLSTM is very memory efficient)

Edit : i just stop training my siamese network (3sec fixed length audio) at step 15 and still improving ! 95% binary accuracy and 0.02 BCE loss i’ts so good !

ghost commented 4 years ago

We already have a pretrained tensorflow 1.x synthesizer that is based on Rayhane-mamah's Tacotron2. I want to convert those weights to pytorch. My current pt model is Tacotron1 so it is incompatible. I'll find a torch-based Tacotron2 code, add the speaker embedding part and make it line up with the tf-based Tacotron2. Then hope for the best when I run your conversion script.

Edit: I only have 4 GB, everyone else trains at batch=36 so I convert my own numbers to make them comparable. At this point in the training schedule I'm doing batch=6 with max_length=500. Not limited by memory. This is my first synth from scratch and I am experimenting to see what works well.

Ananas120 commented 4 years ago

Here is the pytorch-based implementation i used to recode it in tensorflow https://github.com/NVIDIA/tacotron2

I already trained the model (pretrained) to retrain it in french but forgot my training parameters... i will say you when i refind my training code... Good luck ! you can also look at the issues in the nvidia or other tacotron github, i already seen issues where the model speaks « too fast »

ghost commented 4 years ago

Thank you @Ananas120 . Currently I am trying to get the tacotron2 in Mozilla TTS to work, there will be easier integration for global style tokens (#230) since they have already implemented the feature. They also have a more active community where I can request help if needed. The tacotron2 you linked is my next choice.

Ananas120 commented 4 years ago

I will invastiguate the tf1.x checkpoint today and see if it can be loaded by my tf2.0 model, if so, i can convert weights to the nvidia implementation and send you the pytorch checkpoint Note : results will not be exacly same because of the non-use of Zoneout-LSTM (i don’t know if it will change something in fact because i think it’s just a difference in training-dropout but we never know) but for the rest of the model, i think they are really similar

ghost commented 4 years ago

@Ananas120 If you do that, it would be greatly appreciated. In that case I'll try to get the NVIDIA tacotron2 working with our repo. Supposing it works, I will still try to converge to Mozilla TTS because it is actively developed.

Ananas120 commented 4 years ago

No problem, if you want to compare your model and the checkpoint, you can try to load tf checkpoint variables individually (with a combination of tf.train.list_variables(ckpt_path) and tf.load_variable(ckpt_path, name) or something like that) and inspect variables by name and by shape and also by order to see where your variables differs from the checkpoint (my code get_tf_layers extract « layers variables » from a list of all variables based on name so you can also compare based on layers)

Ananas120 commented 4 years ago

@blue-fish Here is a list of variables in the checkpoint and a list of my variables

TF1.x checkpoint (this repo) : Tacotron_model/inference/decoder/Location_Sensitive_Attention : [[128], [128]] Tacotron_model/inference/decoder/Location_Sensitive_Attention/location_features_convolution : [[32], [31, 1, 32]] Tacotron_model/inference/decoder/Location_Sensitive_Attention/location_features_layer : [[32, 128]] Tacotron_model/inference/decoder/Location_Sensitive_Attention/query_layer : [[1024, 128]] Tacotron_model/inference/decoder/decoder_LSTM/multi_rnn_cell/cell_0/decoder_LSTM_1 : [[4096], [2048, 4096]] Tacotron_model/inference/decoder/decoder_LSTM/multi_rnn_cell/cell_1/decoder_LSTM_2 : [[4096], [2048, 4096]] Tacotron_model/inference/decoder/decoder_prenet/dense_1 : [[256], [80, 256]] Tacotron_model/inference/decoder/decoder_prenet/dense_2 : [[256], [256, 256]] Tacotron_model/inference/decoder/linear_transform_projection/projection_linear_transform_projection : [[160], [1792, 160]] Tacotron_model/inference/decoder/stop_token_projection/projection_stop_token_projection : [[2], [1792, 2]] Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/bw/encoder_bw_LSTM : [[1024], [768, 1024]] Tacotron_model/inference/encoder_LSTM/bidirectional_rnn/fw/encoder_fw_LSTM : [[1024], [768, 1024]] Tacotron_model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/encoder_convolutions/conv_layer_1_encoder_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/encoder_convolutions/conv_layer_2_encoder_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/encoder_convolutions/conv_layer_3_encoder_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/encoder_convolutions/conv_layer_3_encoder_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference : [[66, 512]] Tacotron_model/inference/memory_layer : [[768, 128]] Tacotron_model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/postnet_convolutions/conv_layer_1_postnet_convolutions/conv1d : [[512], [5, 80, 512]] Tacotron_model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/postnet_convolutions/conv_layer_2_postnet_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/postnet_convolutions/conv_layer_3_postnet_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/postnet_convolutions/conv_layer_3_postnet_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/postnet_convolutions/conv_layer_4_postnet_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/postnet_convolutions/conv_layer_4_postnet_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/postnet_convolutions/conv_layer_5_postnet_convolutions/batch_normalization : [[512], [512], [512], [512]] Tacotron_model/inference/postnet_convolutions/conv_layer_5_postnet_convolutions/conv1d : [[512], [5, 512, 512]] Tacotron_model/inference/postnet_projection/projection_postnet_projection : [[80], [512, 80]] : [[]]

My implementation (based on NVIDIA) : encoder_embeddings : [(148, 512)] encoder_conv_1 : [(5, 512, 512), (512,)] batch_normalization : [(512,), (512,), (512,), (512,)] encoder_conv_2 : [(5, 512, 512), (512,)] batch_normalization_1 : [(512,), (512,), (512,), (512,)] encoder_conv_3 : [(5, 512, 512), (512,)] batch_normalization_2 : [(512,), (512,), (512,), (512,)] encoder_lstm/forward_lstm/lstm_cell_1 : [(512, 1024), (256, 1024), (1024,)] encoder_lstm/backward_lstm/lstm_cell_2 : [(512, 1024), (256, 1024), (1024,)] tacotron2/decoder/prenet/prenet_layer_1 : [(80, 256)] tacotron2/decoder/prenet/prenet_layer_2 : [(256, 256)] tacotron2/decoder/attention_rnn : [(768, 4096), (1024, 4096), (4096,)] tacotron2/decoder/location_attention/query_layer : [(1024, 128)] tacotron2/decoder/memory_layer : [(512, 128)] tacotron2/decoder/location_attention/value_layer : [(128, 1)] tacotron2/decoder/location_attention/location_layer/location_conv : [(31, 2, 32)] tacotron2/decoder/location_attention/location_layer/location_dense : [(32, 128)] tacotron2/decoder/decoder_rnn : [(1536, 4096), (1024, 4096), (4096,)] tacotron2/decoder/linear_projection : [(1536, 80), (80,)] tacotron2/decoder/gate_output : [(1536, 1), (1,)] postnet_conv_1 : [(5, 80, 512), (512,)] batch_normalization_3 : [(512,), (512,), (512,), (512,)] postnet_conv_2 : [(5, 512, 512), (512,)] batch_normalization_4 : [(512,), (512,), (512,), (512,)] postnet_conv_3 : [(5, 512, 512), (512,)] batch_normalization_5 : [(512,), (512,), (512,), (512,)] postnet_conv_4 : [(5, 512, 512), (512,)] batch_normalization_6 : [(512,), (512,), (512,), (512,)] postnet_conv_6 : [(5, 512, 80), (80,)] batch_normalization_7 : [(80,), (80,), (80,), (80,)]

Main differences :

So... they are a lot of changes in architecture, do you think it is interesting to try to convert weights to my model (if i make the changes) or not ? Because i can make changes in term of architecture but i can't be sure the forward pass will be the same...

I think i will not try to convert the model and directly retrain one from my existing model (because, at the end, the target for me is to retrain it so retrain one model or another, at the end, it's the same) But if you want to try, i can share my tf2.0 implementation and you can modify it to have same architecture... after that, rearranging variables is borring but not difficult

Note : to convert from tensorflow to pytorch, it's much easier because state dict is vased on name (and not a list based on the structure) so if you name your modules and weights like the chekpoint, it's very easy to match them so you just need to create a model with these modules / sub-modules and the corresponding shapes (how do you create 'code block' in github ? like that i could post my code to get variables of the checkpoint and load them, it can help you to have idea of variables to create)

Note 2 : my siamese encoder achieves 96% binary accuracy with 0.1 BCE loss at step 20, it awesome ! (and i think it can improve more) and the plot is really nice too

ghost commented 4 years ago

Main differences :

  • Some bias are used in mine but not the other (or in the other sense)
  • Order and names are not the same (really problematic to convert them...)
  • He uses a 2-frames by step (me 1 by step)
  • He uses a linear layer as projection for postnet (me is a convolution)
  • I don't find the "value layer" in the attention layer but instead i have this : """ Tacotron_model/inference/decoder/Location_Sensitive_Attention : [[128], [128]] // Full name of the variables are : Tacotron_model/inference/decoder/Location_Sensitive_Attention/attention_bias Tacotron_model/inference/decoder/Location_Sensitive_Attention/attention_variable_projection """ And i don't have the equivalent in my layer, very strange

So... they are a lot of changes in architecture, do you think it is interesting to try to convert weights to my model (if i make the changes) or not ? Because i can make changes in term of architecture but i can't be sure the forward pass will be the same...

Thank you for that analysis. The differences are too significant for a model conversion to work at this time. It is going to be a 3-step process:

  1. Integrate a pytorch-based tacotron2 with this repo.
  2. Modify tacotron2 to match the tensorflow-based implementation in this repo to the extent possible.
  3. Convert the weights.

Or a 2-step process:

  1. Rewrite this repo's tacotron2 in pytorch
  2. Convert the weights

Since you found significant differences, the 2-step process is likely the better approach (a lot like https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28 ).

i could post my code to get variables of the checkpoint and load them, it can help you to have idea of variables to create)

It would be very helpful if you could share the code. You can make code blocks by wrapping the code with ```

Ananas120 commented 4 years ago

Here is the code (it's just a modified version of my get_tf_layer to use checkpoint variables instead of model variables)

import tensorflow as tf

def get_ckpt_layers(ckpt):
    layers = {}
    for name, shape in ckpt:
        layer_name = '/'.join(name.split('/')[:-1])
        # To remove duplicata variables layer of the checkpoint
        if len(layer_name.split('/')) > 3 and layer_name.split('/')[2] == 'inference': continue
        # print(name) # uncomment to show the full name of variable
        if layer_name not in layers: layers[layer_name] = []
        layers[layer_name].append(shape)
    return layers

path = 'your_path_to_ckpt_dir'

ckpt = tf.train.list_variables(path) # get variables in the checkpoint as list of tuple [(name, shape), ...]

layers = get_ckpt_layers(ckpt)
for name, shapes in layers.items():
    print("{} : {}".format(name, shapes))

### To load variables
variables = {n : tf.train.load_variable(path, n) for n, _ in ckpt]

I am adapting my model to accept embedding vector, i think i will launch first training only tomorrow or this week-end because i also have to create all embeddings vectors for my 2 datasets and before that i would like to train more my siamese

2 questions to be sure before my training :

Good luck for your convertion to pytorch !

ghost commented 4 years ago

Thank you so much for the code! I am still learning so that saves me a lot of time.

2 questions to be sure before my training :

  • CommonVoice fr has many accents, i think it could be better to use only french accent no ?

You're talking about spoken accent (and not text accents like è)? Ideally you would label the various accents using a style token (#230) but we do not have the feature yet. That would enable you to synthesize a specific accent later on. I think it is preferable to not mix accents in your training data. But the training process should be robust and still produce a usable result even if you have multiple accents in your data.

As for text accents make sure your synthesizer/utils/symbols.py has all of the symbols that are used in your transcripts (at least the ones you want to train on). This is what I'm talking about (other tacotron implementations also use the file): https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/3eb96df1c6b4b3e46c28c6e75e699bffc6dd43be

  • In fact, the only difference with the original Tacotron-2 model is that the output of the encoder is concatenated with the speaker embedding, no other changes in architecture itself ?

Yes, this is correct. That's how it is in the SV2TTS paper. I also painstakingly diffed Rayhane's tacotron2 and this one and came to the same conclusion.

Ananas120 commented 4 years ago

No problem, i am still learning too ;)

I updated my converter code to make it more robusts I also write all the pipeline and the model so, in theory, the only things i need to add is the speaker embedding and i can run it ! My training hyperparameters were 50 frames / step (splitting) with batch_size of 16 and max text length of 150 (char) but i have a 6GB GPU memory so... i suppose my model have more weights... i have a trining time of 2-4 sec/step if i remember (1 step = 1 full audio so 8 optimization step on 50 frames for a 400 frames mel spectrogram)

Ananas120 commented 4 years ago

I just launch the embedding process for my dataset but it takes very long time... the most surprising is that this is not the embedding part which is long but the audio loading (just the librosa.load function call) which is limited to around 10 load / sec (and for 260 000 audios, it takes many time...) so i think i can run my first tacotron-2 training only tomorrow

I will also share my embeddings on my github if it can help (it’s for datasets CommonVoice (fr part) and SIWIS (fr)) I embed with the final model which has 97.5% accuracy (training and test set) and BCE-loss of 0.02 on validation set

For the tacotron-2, i will use my pretrained model with partial transfert learning, i hope it can save me lot of time (no need to learn the attention mechanism for example, i hope all will work correctly !)

ghost commented 4 years ago

Tacotron2 implementation is canceled for now. Started on it and it is too much work to convert a pretrained model with known issues.

Speaking of known issues, this is the list for the pytorch synthesizer:

Ananas120 commented 4 years ago

I just launch my tacotron training but i have really strange issue with RNN and i don’t know why... i relaunch with a little modification and will see if it works

I launch with 120 000 training set and 40 000 in validation set with batch_size of 16 and 50 frames splitting (max 150 char in input text and no limitation ond mel) 1 step takes around 11 sec so it’s very slow... i then launch for only 3 epochs and will see results... As i used pretrained model, i have already a loss of 2.x at step 200 and the attention mechanism is already learned (i suppose) so i hope it will be ok with only 3 epochs

The GTA is not necessary, you can train your model without using it For dropout, make sure it will always be active (so look at the NVIDIA implementation to see how they do that, i can’t help you... in tensorflow i just call directly K.dropout without using the training argument so it’s active also in inference)

Good luck and will tell you my results when i have something... I also created a predictor_callback to make prediction and inference every 1000 steps but it will only generates mel and not the corresponding audio so not very mieaningful but i can still see the attention mechanism and the similarity between mels ^^

Also due to the pretraining, the gate loss is approximately 0 so it’s also a good thing and great advantage

ghost commented 4 years ago

For dropout, make sure it will always be active (so look at the NVIDIA implementation to see how they do that, i can’t help you... in tensorflow i just call directly K.dropout without using the training argument so it’s active also in inference)

Thanks for the suggestion @Ananas120 , I pushed a couple of commits to always enable the dropout in the prenet, and it is working as expected. The "random seed" function can be used to get deterministic behavior.

I will push a pretrained model to #472 tomorrow. My GPU is down for the moment so I can't continue the training but it is good enough for evaluation purposes. It still has gaps (due to punctuation) and leaves a lot to be desired.

Ananas120 commented 4 years ago

Good ! me i always try to run it before bugs but i really don’t understand some strange things... For instance, tensorflow runs out of memory for some optimization process but not all so... don’t understant because all my frames are of size 45 max (because of my splitting) Another thing is the bug that crashes all process but it appears randomly x) The error is « Internal error : failed to call ThenRNNBackward with model config ... » if you already faced this issue...

Now i am trying with a try except and will see if it solves the problem but my training process code is so bad x) but if it can work i will be happy haha

Edit : most of times the error occurs at step 30, 50, ... and now i added the try except and want to test it to see if i t works or not, the error doesn’t appear ! step 188 now and nothing x) Loss of 2.8 at step 288, not bad i think (my loss is the sum of BCE-gate loss, MSE-mel loss and MSE-mel-postnet loss so in fact i only have a mel-loss of 1.4

Ananas120 commented 4 years ago

It seems to be good ! i downgrade tensorflow to version 2.1 and the error still not occur at step 660 ! (it’s logic because i already trained this model with tf2.1 3 months ago without any issue so...) but unfortunately i had to change my callbacks because the version 2.1 and 2.3 are not identical so i have no progressbar so i can’t know the loss before it ends... (i normally have the predictions every 1k steps to check the quality of the predicted spectrogram but it’s all) To see progress i use the tqdm progress bar but no metrics shown so... i hope it improves well :3 (in last trys with bugs, loss decreases to 2.9 around step 200 so not so bad) Results in... 3 days i think !

Ananas120 commented 4 years ago

@blue-fish so... my results are bad... After 1-2k steps my loss was around 1.0 and now after 5-6k steps my loss is around 1.2 so... i don’t really understand why it increases... I will train untill tomorrow and if loss is not below 1.0 after the first full epoch, i will stop training and try to optimize my model to run it much faster because 15s/step is really too slow I don’t know if the result is bad because i use another speaker encoder or because i am so unimpatient and i just need to wait more or because my splitting-training method works bad for this architecture

I think it can impact because i make 6 or more optimization step on the same audio (different frames) but same speaker so perhaps it should be better if i do only 1 optimization on the whole spectrogram

So i have a question : how is computed gradients in RNN ? Because something i can try is to get gradients for each sub-part of the spectrogram and then sum them or make mean (don’t know what i should do) and then make optimization with that (and then only 1 optimization / step) by summing the sub-gradients

What do you think about that ?

Note : i don’t know if i can do this with tensorflow but i can try if it can help my training...

I will also try to optimize my decoder because i think i make too much single operations and could make them once so i hope it will help to run faster and use less memory (to run on more than 50 frames per block)

Edit : i will also share my tf2.0 tacotron-2 implementation so if you know a little about tensorflow, don’t hesitate to say me if you see something that can be optimized ! ;)

ghost commented 4 years ago

@Ananas120 It is my understanding that the decoder uses its previous state in generating the output for the next frame. For training we typically override the actual state with the ground truth mel spectrogram. So the previous frame of the ground truth mel and encoder output are used to predict the mel for the current frame. Are you doing this?

I do not know the answer to your question about the RNN gradient calculation.

Are you aware of TensorflowTTS? You might want to try that out, or compare codes.

Ananas120 commented 4 years ago

Thank you i didn’t know this repo ! i think i will try to copy their implementation of Tacotron so it can work better i hope

Yes yes i use the previous state to generate next frame (i keep track of the new internal state during optimization iterations)

ghost commented 4 years ago

Yes yes i use the previous state to generate next frame (i keep track of the new internal state during optimization iterations)

Just to be sure, are you getting this previous state from the model, or the ground-truth spectrogram? (It should be from the spectrogram)

Ananas120 commented 4 years ago

It depends what you call « state » For me the state is the internal state of the LSTMCell of the decoder so i use the state of the model But for you the « state » refers (i suppose) to the decoder input (the spectrogram) and then yes of course i use the true spectrogram as input and not the last decoder output

Finally i d’ont copy the model of TensorflowTTS because their architecture looks like the architecture of this repo (and so different of mine so i can’t use transfer learning with my model) but i achieved to optimize my model to run it in graph mode which is much faster

The drawback is that i have to reduce the maximum frames to 25 instead of 50 but also with that, the training time is divided by 2 I just have an error i don’t understand but i will solve it tomorrow and relaunch the training I will also test to make mean of whole gradients (or sum) to make 1 optimization instead of n_frames / 25 which is not really good x)

I shared my model implementation with the training step on my github if you want to see it ^^

Edit : finally i didn’t achieve to make sum or mean of gradients because of multiple errors (like OOM) Si i run it as it with the model in graph mode (that multiplies the speed of call() method by 4) so i run training with these parameters :

Training step is around 14sec (like the old model) but with 4x bigger batch size so only around 8h/epoch so really much better

Edit 2 : after around 4k steps i have a loss of 0.95 (mel_postnet_loss of 0.4 and gate_loss of 0.11) so attention is not yet learned and audio result is bad but it’s only 4k steps and loss always decreases each 50 steps so if it decreases below 0.9 today it can be really interesting after a few days training ! Just a very strange thing : i have an OOM error after more than 1 epoch so don’t understand why because i don’t cache data and it already does 1 epoch so no « new bigger length » in my data because it already make training on all datas x) but no problem i save the model every hour

Ananas120 commented 4 years ago

@blue-fish have you continued to train your model or not ? And if yes, do you have better results ?

My loss is actually at 0.95 at the beginning of the epoch but the mean loss of the last epoch was 0.89 so for tomorrow (after 4k steps more) :

I hope it will decreases below 0.7, it can be really good !