Closed hadaev8 closed 5 years ago
Make sure the model has learned attention during training. We have found that it is easier for the model to learn attention if silence is trimmed from the beginning and end of audio files.
I listened to the files, there is no (or too little) silence at the beginning or end in dataset. I will keep in mind that it is better to remove silence, but probably not the best idea to change dataset during training (or its ok? bit scary to ruin everything couz im using tesla k80 and its all need a lot of time). So how long does it take LJ Speech? I have 0.2 val loss decrease by 1k steps, so it trains, but slower then I expected by 500 iterations in defult hyper params file.
Also do I need to train tacotron before waveglow? It seems strange to me that I can put waveglow training only with audio files without spectrograms from tacotron.
How big is your dataset?
Im using 1st male russian speaker from here http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/ 20 hours total
It is ok to train Tacotron before glow. Can you share your current tensorboard with alignments? If the model doesn't learn the alignments you can't perform inference.
I mean can I train tacotron and glow at the same time? By readme, it seems glow doesn't need tacotron model for train. It seems strange a bit.
My logs lost by crush of google colab vm, but I run tacotron from the checpoint against test set, seems hopeless. https://colab.research.google.com/drive/1eGEXdrgioRs-7R0WOEx2_fHmDC-zDmzk#scrollTo=5nJcq9nHijEH&line=7&uniqifier=1
I can confirm that silence removal helps a lot to reduce loss.
Also here is logs https://drive.google.com/open?id=1ugef68POLQXEO1ERSfW6qL_T0YEVbOIE not sure how to interpret distributions and histograms.
Just share an image from tensorboard that shows the alignments and mels
Val loss http://puu.sh/CiAio/87de1bd8c1.png Alignment http://puu.sh/CiAkW/eb0d821d1e.png Mels http://puu.sh/CiAlV/ad0f4f39af.png
I probably should think about it before, but need I put Russian symbols to text/symbols.py ?
Yes! Get the set of characters in the alphabet and modify symbols.py in the text/ folder.
As I understand it, there is not much difference, because one of the functions transfers characters to Latin. Now I try Cyrillic characters in the dictionary and parameter base cleaners.
I still don’t understand how many iterations have to go before I can be sure that this doesn’t work. When training for LJSpeech dataset, you reduced learning rate while training?
Train until you see a yellow-ish diagonal line on the attention plots.
But I am tormented by doubts that my dataset needs other hyper parameters. How long does it take to be sure that the algorithm cannot converge? If val loss started to increase, does that mean everything broke down?
Its again me Step 4700 Seems it overfit http://puu.sh/Cl8ls/e5bb692522.png Attention doesnt work http://puu.sh/Cl8m4/101a086539.jpg Mels is bad too http://puu.sh/Cl8n1/07e406388b.jpg
Any tips?
@hadaev8 Could you share your an audio data and a melspectrogram of the audio data. I would like to check whether preprocessing.
Preprocessing is very important to learn text to melspectrogram model. Such as trimming silence audios. Silence audios in training dataset hinder to converge your model.
When model trained audios with trimming, maybe it occurs that your model fail to predict audio length.
To prevent this problem, add small silence audios(such as 3 or 4 times of hop_length ) at the end of audios.
This is my default dataset https://drive.google.com/open?id=1FCwMelicl_dwE8Y5MwuGw7Pd4zDY-OVp This is dataset with trimmed silence https://drive.google.com/open?id=13Mkp_6Sm_oj8jQoUvnYpxxZw6yXvcJpk
The code I'm used to preprocessing https://pastebin.com/4MzD1x6C Spectrogramm, wavs predicted and from dataset https://drive.google.com/open?id=1e7Plu2x2WS8_k2yWa_ZJ_jWXGUU_oz01
I see predicted wav have 2 extra seconds, does it means I get this problem?
@hadaev8 Maybe yes, but it looks like that your model fail to learn attention values.
Check my preprocessing code and 80k iteration model. My dataset has korean 13k samples. https://github.com/Yeongtae/tacotron2/blob/master/preprocess_audio.py
@Yeongtae On what iteration its attention starts looking ok? What loss you have at the end? What learning rate you use?
Maybe around 30 epoch.
Uh, its better with your preprocession (by test loss), but attention layer still bad. Do i need upper case symbols in dict? I see it auto lower case in basic cleaners. Or dict len doesnt matter?
@hadaev8 We must construct char-embedding function for our own language. Here is the example. https://github.com/Yeongtae/tacotron2/blob/master/text/__init__.py
In my opinion, It's better to do debug your char-embedding function.
Russian very similar to latin languages. So, I'm just using basic cleaners with new alphabet symbols. Seems everything is ok. http://puu.sh/CtisY/131115d283.png
Also my current attention for 400 epoch, I can count it doesn't work, right? http://puu.sh/CtiIa/59a17c577c.jpg
Also, I find CMUDict symbols greatly increase the number of symbols in hyper param file. Not sure if this parameter affects learning and what is CMUDict, but I will try to start training without it now.
What's the size of your dataset? Alignment won't be learnt unless the corpus is large (~LJSpeech).
pravn.wordpress.com
On Mon, Jan 7, 2019, 7:01 PM Had <notifications@github.com wrote:
Russian very similar to latin languages. So, I'm just using basic cleaners with new alphabet symbols. Seems everything is ok. http://puu.sh/CtisY/131115d283.png
Also my current attention for 400 epoch, I can count it doesn't work, right? http://puu.sh/CtiIa/59a17c577c.jpg
Also, I find CMUDict symbols greatly increase the number of characters in hyper param file. Not sure if this parameter affects learning and what is CMUDict, but I will try to start training without it now.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/111#issuecomment-452158719, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhnVekOCRL-e3ENwabZLN_c6sGAZ0Riks5vBAoFgaJpZM4ZRc4A .
20 hours, should be ok i think.
Colud you share your dimension of char-embedding.
This, right? http://puu.sh/CtkEZ/867200fd25.png Without cmudict it is 45 https://github.com/hadaev8/tacotron2/blob/master/text/symbols.py
It's maybe helpful to reduce these parameters. ex) half Because your char embedding dimension is so small. In my case, 80.
In addition, why do you use (sr:16000?)
I will try to reduce it. Yes, it is 16000, its sr of dataset, is it important? I can resample it ofc.
Attention after 200 epoch, doesnt work, right? http://puu.sh/CtxAV/3f558e9b8f.png
Yes, I recommend you set encod-embed-dim to 128 with other default params
Should i reduce decoder_rnn_dim=1024 ?
Not yet
128 128 encoder decoder dim, 32 batch size, rest default. Still seems broken http://puu.sh/CtIm8/5bbe8d1c05.png
Decoder dim affects quality of the output melspectrogram. In my opinion, It's better to use default params for decoder dim.
Anyway, you need to more train, feel free.
I wrote wrong, I meant embedding and encoder dim. Decoder is default. Also now trying with 1e-4 lr, I have no more ideas. (except that i need 10k+ iterations)
@Yeongtae should you post your tensorflow logs? I wonna look how it looks then things works
@hadaev8 Could you get better results?
Mm, no. Now I'm trying with default sample rate, but it doesn't seems to coverage. At least at 200 epoch http://puu.sh/Cv4LF/18cc92c80b.png
If you dont see diagonal attention around 30 epoch, stop training
Im probably retarded. Only now after all these days i come up with idea of checking files lists and find I mixed up file names and 9k samples was in validation file instead of train. It should explain everything.
In the studies related tacotron, phoneme embedding make more easy to converge model. I recommend it to you. Using lexicon model, make your own phoneme embedding function.
I can report it doesnt work with 128 and default 512 embedding/encoder dim It how it looks now at 30 epoch for 512 http://puu.sh/Cz6hJ/f6a4ef05f8.png
Do i need phonem data for lexicon model?
@hadaev8 Any success?
No attention looks this after 50 iterations http://puu.sh/CExPx/83a39b7259.png
Maybe I need more, idk.
@hadaev8 Normalizing values of melspectrogram can help to train model. Because original values are skewed in negative direction.
Here is my implementation for normalization. https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/layers.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/audio_processing.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py I refer to code of https://github.com/Rayhane-mamah/Tacotron-2
@Yeongtae Have you tried to reproduce original results? 33 iteration with default dataset and code. Attention still doesnt work
Seems it work after 45 epoch.
Training on Russian dataset, output says I have less than 0.2 loss, the default is 500 epoch. Now I'm on 1333 epoch and still get Warning! Reached max decoder steps Should I counting or it screwed up? http://puu.sh/CgXpt/41886048cd.jpg