NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

How much iterations i need? #111

Closed hadaev8 closed 5 years ago

hadaev8 commented 5 years ago

Training on Russian dataset, output says I have less than 0.2 loss, the default is 500 epoch. Now I'm on 1333 epoch and still get Warning! Reached max decoder steps Should I counting or it screwed up? http://puu.sh/CgXpt/41886048cd.jpg

rafaelvalle commented 5 years ago

Make sure the model has learned attention during training. We have found that it is easier for the model to learn attention if silence is trimmed from the beginning and end of audio files.

hadaev8 commented 5 years ago

I listened to the files, there is no (or too little) silence at the beginning or end in dataset. I will keep in mind that it is better to remove silence, but probably not the best idea to change dataset during training (or its ok? bit scary to ruin everything couz im using tesla k80 and its all need a lot of time). So how long does it take LJ Speech? I have 0.2 val loss decrease by 1k steps, so it trains, but slower then I expected by 500 iterations in defult hyper params file.

Also do I need to train tacotron before waveglow? It seems strange to me that I can put waveglow training only with audio files without spectrograms from tacotron.

rafaelvalle commented 5 years ago

How big is your dataset?

hadaev8 commented 5 years ago

Im using 1st male russian speaker from here http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/ 20 hours total

rafaelvalle commented 5 years ago

It is ok to train Tacotron before glow. Can you share your current tensorboard with alignments? If the model doesn't learn the alignments you can't perform inference.

hadaev8 commented 5 years ago

I mean can I train tacotron and glow at the same time? By readme, it seems glow doesn't need tacotron model for train. It seems strange a bit.

My logs lost by crush of google colab vm, but I run tacotron from the checpoint against test set, seems hopeless. https://colab.research.google.com/drive/1eGEXdrgioRs-7R0WOEx2_fHmDC-zDmzk#scrollTo=5nJcq9nHijEH&line=7&uniqifier=1

hadaev8 commented 5 years ago

I can confirm that silence removal helps a lot to reduce loss.

hadaev8 commented 5 years ago

Also here is logs https://drive.google.com/open?id=1ugef68POLQXEO1ERSfW6qL_T0YEVbOIE not sure how to interpret distributions and histograms.

rafaelvalle commented 5 years ago

Just share an image from tensorboard that shows the alignments and mels

hadaev8 commented 5 years ago

Val loss http://puu.sh/CiAio/87de1bd8c1.png Alignment http://puu.sh/CiAkW/eb0d821d1e.png Mels http://puu.sh/CiAlV/ad0f4f39af.png

hadaev8 commented 5 years ago

I probably should think about it before, but need I put Russian symbols to text/symbols.py ?

rafaelvalle commented 5 years ago

Yes! Get the set of characters in the alphabet and modify symbols.py in the text/ folder.

hadaev8 commented 5 years ago

As I understand it, there is not much difference, because one of the functions transfers characters to Latin. Now I try Cyrillic characters in the dictionary and parameter base cleaners.

I still don’t understand how many iterations have to go before I can be sure that this doesn’t work. When training for LJSpeech dataset, you reduced learning rate while training?

rafaelvalle commented 5 years ago

Train until you see a yellow-ish diagonal line on the attention plots.

hadaev8 commented 5 years ago

But I am tormented by doubts that my dataset needs other hyper parameters. How long does it take to be sure that the algorithm cannot converge? If val loss started to increase, does that mean everything broke down?

hadaev8 commented 5 years ago

Its again me Step 4700 Seems it overfit http://puu.sh/Cl8ls/e5bb692522.png Attention doesnt work http://puu.sh/Cl8m4/101a086539.jpg Mels is bad too http://puu.sh/Cl8n1/07e406388b.jpg

Any tips?

Yeongtae commented 5 years ago

@hadaev8 Could you share your an audio data and a melspectrogram of the audio data. I would like to check whether preprocessing.

Preprocessing is very important to learn text to melspectrogram model. Such as trimming silence audios. Silence audios in training dataset hinder to converge your model.

When model trained audios with trimming, maybe it occurs that your model fail to predict audio length.

To prevent this problem, add small silence audios(such as 3 or 4 times of hop_length ) at the end of audios.

hadaev8 commented 5 years ago

This is my default dataset https://drive.google.com/open?id=1FCwMelicl_dwE8Y5MwuGw7Pd4zDY-OVp This is dataset with trimmed silence https://drive.google.com/open?id=13Mkp_6Sm_oj8jQoUvnYpxxZw6yXvcJpk

The code I'm used to preprocessing https://pastebin.com/4MzD1x6C Spectrogramm, wavs predicted and from dataset https://drive.google.com/open?id=1e7Plu2x2WS8_k2yWa_ZJ_jWXGUU_oz01

I see predicted wav have 2 extra seconds, does it means I get this problem?

Yeongtae commented 5 years ago

@hadaev8 Maybe yes, but it looks like that your model fail to learn attention values.

Check my preprocessing code and 80k iteration model. My dataset has korean 13k samples. https://github.com/Yeongtae/tacotron2/blob/master/preprocess_audio.py sentence_0

hadaev8 commented 5 years ago

@Yeongtae On what iteration its attention starts looking ok? What loss you have at the end? What learning rate you use?

Yeongtae commented 5 years ago

Maybe around 30 epoch.

hadaev8 commented 5 years ago

Uh, its better with your preprocession (by test loss), but attention layer still bad. Do i need upper case symbols in dict? I see it auto lower case in basic cleaners. Or dict len doesnt matter?

Yeongtae commented 5 years ago

@hadaev8 We must construct char-embedding function for our own language. Here is the example. https://github.com/Yeongtae/tacotron2/blob/master/text/__init__.py

In my opinion, It's better to do debug your char-embedding function.

hadaev8 commented 5 years ago

Russian very similar to latin languages. So, I'm just using basic cleaners with new alphabet symbols. Seems everything is ok. http://puu.sh/CtisY/131115d283.png

Also my current attention for 400 epoch, I can count it doesn't work, right? http://puu.sh/CtiIa/59a17c577c.jpg

Also, I find CMUDict symbols greatly increase the number of symbols in hyper param file. Not sure if this parameter affects learning and what is CMUDict, but I will try to start training without it now.

pravn commented 5 years ago

What's the size of your dataset? Alignment won't be learnt unless the corpus is large (~LJSpeech).

pravn.wordpress.com

On Mon, Jan 7, 2019, 7:01 PM Had <notifications@github.com wrote:

Russian very similar to latin languages. So, I'm just using basic cleaners with new alphabet symbols. Seems everything is ok. http://puu.sh/CtisY/131115d283.png

Also my current attention for 400 epoch, I can count it doesn't work, right? http://puu.sh/CtiIa/59a17c577c.jpg

Also, I find CMUDict symbols greatly increase the number of characters in hyper param file. Not sure if this parameter affects learning and what is CMUDict, but I will try to start training without it now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/111#issuecomment-452158719, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhnVekOCRL-e3ENwabZLN_c6sGAZ0Riks5vBAoFgaJpZM4ZRc4A .

hadaev8 commented 5 years ago

20 hours, should be ok i think.

Yeongtae commented 5 years ago

Colud you share your dimension of char-embedding.

hadaev8 commented 5 years ago

This, right? http://puu.sh/CtkEZ/867200fd25.png Without cmudict it is 45 https://github.com/hadaev8/tacotron2/blob/master/text/symbols.py

Yeongtae commented 5 years ago

image It's maybe helpful to reduce these parameters. ex) half Because your char embedding dimension is so small. In my case, 80.

Yeongtae commented 5 years ago

In addition, why do you use (sr:16000?)

hadaev8 commented 5 years ago

I will try to reduce it. Yes, it is 16000, its sr of dataset, is it important? I can resample it ofc.

hadaev8 commented 5 years ago

Attention after 200 epoch, doesnt work, right? http://puu.sh/CtxAV/3f558e9b8f.png

Yeongtae commented 5 years ago

Yes, I recommend you set encod-embed-dim to 128 with other default params

hadaev8 commented 5 years ago

Should i reduce decoder_rnn_dim=1024 ?

Yeongtae commented 5 years ago

Not yet

hadaev8 commented 5 years ago

128 128 encoder decoder dim, 32 batch size, rest default. Still seems broken http://puu.sh/CtIm8/5bbe8d1c05.png

Yeongtae commented 5 years ago

Decoder dim affects quality of the output melspectrogram. In my opinion, It's better to use default params for decoder dim.

Anyway, you need to more train, feel free.

hadaev8 commented 5 years ago

I wrote wrong, I meant embedding and encoder dim. Decoder is default. Also now trying with 1e-4 lr, I have no more ideas. (except that i need 10k+ iterations)

hadaev8 commented 5 years ago

@Yeongtae should you post your tensorflow logs? I wonna look how it looks then things works

Yeongtae commented 5 years ago

@hadaev8 Could you get better results?

hadaev8 commented 5 years ago

Mm, no. Now I'm trying with default sample rate, but it doesn't seems to coverage. At least at 200 epoch http://puu.sh/Cv4LF/18cc92c80b.png

Yeongtae commented 5 years ago

If you dont see diagonal attention around 30 epoch, stop training

hadaev8 commented 5 years ago

Im probably retarded. Only now after all these days i come up with idea of checking files lists and find I mixed up file names and 9k samples was in validation file instead of train. It should explain everything.

Yeongtae commented 5 years ago

In the studies related tacotron, phoneme embedding make more easy to converge model. I recommend it to you. Using lexicon model, make your own phoneme embedding function.

hadaev8 commented 5 years ago

I can report it doesnt work with 128 and default 512 embedding/encoder dim It how it looks now at 30 epoch for 512 http://puu.sh/Cz6hJ/f6a4ef05f8.png

Do i need phonem data for lexicon model?

dnnnew commented 5 years ago

@hadaev8 Any success?

hadaev8 commented 5 years ago

No attention looks this after 50 iterations http://puu.sh/CExPx/83a39b7259.png

Maybe I need more, idk.

Yeongtae commented 5 years ago

@hadaev8 Normalizing values of melspectrogram can help to train model. Because original values are skewed in negative direction.

Here is my implementation for normalization. https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/layers.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/audio_processing.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py I refer to code of https://github.com/Rayhane-mamah/Tacotron-2

hadaev8 commented 5 years ago

@Yeongtae Have you tried to reproduce original results? 33 iteration with default dataset and code. Attention still doesnt work

hadaev8 commented 5 years ago

Seems it work after 45 epoch.