Closed NileZhou closed 5 years ago
I don't know your batch size. My suggest is that you should run at least 100 epochs.
Training 100 epochs make your model be able to generate reasonable voice. But for channel size 512 version, you need more than 500 epochs to get high quality voice.
I fine-tuned the released LJ speech model on a Mandarin dataset. I used a Titan V with fp16 and a batch size of 8. It only took a few hours for the model to produce good quality speech in Mandarin. The loss was around -4.5~-5.2 when it converged.
@lingjzhu can you share a few samples?
@rafaelvalle Sure. I do not have access to those audios but I will share them next week.
@rafaelvalle
Hi. Here are some samples. They sound really good. Thanks for your great work!
Trained for one day. 46k.zip
Trained for three days. 150k.zip
With tacotron2. tacotron2_and_waveglow.zip
These sound pretty good! Do you have an Tacotron2 implementation for mandarin that you can share with our community?
Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.
Closing due to inactivity.
Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.
@lingjzhu Hi, I train an single-speaker open-sourced mandarin data-set for speech synthesis from scratch, total duration is about 12 hours. The data config as follow, data_config: "segment_length": 16000, "sampling_rate": 16000, "filter_length": 743, "hop_length": 185, "win_length": 743, "mel_fmin": 0.0, "mel_fmax": 8000.0
and the model config as follows, "n_mel_channels": 80, "n_flows": 12, "n_group": 8, "n_early_every": 4, "n_early_size": 2, "WN_config": "n_layers": 8, "n_channels": 256, "kernel_size": 3
the loss curve show pretty good,
but when I do inference [call the inference.py], even feed the speech from the train data, the generated audio pretty bad, and sounds like white noise. SIGMA value is set 1 for training, and when inference, I do not change it.
I've solved it ^_^
@shawnthu how did you solve it?
@shawnthu how did you solve your issue?
@rafaelvalle
Hi. Here are some samples. They sound really good. Thanks for your great work!
Trained for one day. 46k.zip
Trained for three days. 150k.zip
With tacotron2. tacotron2_and_waveglow.zip
@lingjzhu For these examples - did you train your your mandarin model with tacotron2 from scratch (using Nvidia-tacotron2?) Or did you fine-tune it with the pretrained LJSpeech Taco2 model? If it is from scratch what are some of the training specs? (e.g. number of iters you did before it converged).
@ricktjwong I trained the Mandarin model with the pretrained LJSpeech Taco2 model, which greatly speeds up the convergence. Those samples were produced by a model after a few thousand iterations. The attention plot became diagonal within a thousand iterations.
I tried to train from scratch but the model did not converge well even after 40k iterations.
You can find training details, code, training data, pre-trained models and demos in this repo: https://github.com/lingjzhu/probing-TTS-models
In fact, I recently finished two projects about Mandarin TTS:
@lingjzhu Thanks for that appreciate the reply! @shawnthu Those samples sound good. What's your procedure to create the two voices in the link? Do you have details on the voice cloning too? Thanks!
I trained waveglow from scratch on a Hindi Dataset for 30K iterations and loss seemed to converge at -5.5~-6.0. But after inferencing, I can listen to the words and everything, but its just not the voice of my speaker. During Inference sigma = 0.6https://drive.google.com/file/d/1DfUyef6XH8HEF-bJPddoLBs2vZnPZ3FA/view?usp=sharing
Please let me know, if this is premature stopping or sigma value which is causing these outputs
@ricktjwong Sorry for the details cuz it's commercial.
@ricktjwong Sorry for the details cuz it's commercial.
I am just asking you to listen to my sample and tell me whether I need to train more or change sigma value
@AnkurDebnath35 I'm sorry, I can not open your link cuz blocked. @lingjzhu In fact, training tacotron from scratch also work, and it seems does not take too much training time for me. Of course, with warm start, It can significantly speed up training.
Let me know if you can access it now https://soundcloud.com/ankur-debnath-9482155/md01-002wav_synthesis-wav
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better not change them. Good luck!
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
@cjerry1243 @deepseek Have u ever tried the default config?
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations : https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
In your opinion, should I stop or train further? It has just trained upto 4K iterations by now, and loss is hovering near -6.7. The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations : https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
In your opinion, should I stop or train further? It has just trained upto 4K iterations by now, and loss is hovering near -6.7. The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.
I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.
E
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations : https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
In your opinion, should I stop or train further? It has just trained upto 4K iterations by now, and loss is hovering near -6.7. The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.
I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.
Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu
E
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations : https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
In your opinion, should I stop or train further? It has just trained upto 4K iterations by now, and loss is hovering near -6.7. The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.
I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.
Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu
My pleasure
E
@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!
Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?
Would you please show you loss curve here?
Sorry, I only have the curve upto 18K iterations only And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations : https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis
It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.
In your opinion, should I stop or train further? It has just trained upto 4K iterations by now, and loss is hovering near -6.7. The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.
I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.
Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu
Here is the training loss at around 9K iterations, not much improvement though in terms of audio quality from what it was at 2K iterations. Should I stop? Because everyone here seems to train upto 50-60K iterations or even more.
too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.
@AnkurDebnath35
too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.
Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?
too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.
Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?
A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.
too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.
Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?
A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.
That I already have with me, its just that where to modify in train.py
, to accomodate val_loss.
Can someone help me ? , I want to cite this repository in my paper, but I dont know whom to cite it against.
@AnkurDebnath35 just cite our paper. and please next time do not ask questions that are not related to the thread.
@inproceedings{prenger2019waveglow, title={Waveglow: A flow-based generative network for speech synthesis}, author={Prenger, Ryan and Valle, Rafael and Catanzaro, Bryan}, booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={3617--3621}, year={2019}, organization={IEEE} }
the train.py's print is: 46739: -4.754265308 46740: -5.550816059 46741: -4.253830433 46742: -5.338192463 46743: -4.700691700 46744: -5.625311375 46745: -5.753829479 46746: -5.032420158 ......
which value when the second number equals, the chenkpoint is useable?