Closed EuphoriaCelestial closed 4 years ago
Yes, you can run inference on FP16 regardless of how you trained. FP16 should improve inference as long as your GPU has tensor cores.
how about the parameters tuning? this is my main concern since I have already using fp16 I can re-train model if needed, all I need is speed
Hi, I have trained a tacotron2 model for a 8kHz dataset and inference speed is amazing. for a 10 second synthesis, average inference duration is about 0.5 second for a Tesla P100. What I wonder is that is fp16_run really effective? I have done tests on this parameter and inference speed seems to be not affected. Thanks.
Hi, I have trained a tacotron2 model for a 8kHz dataset and inference speed is amazing. for a 10 second synthesis, average inference duration is about 0.5 second for a Tesla P100. What I wonder is that is fp16_run really effective? I have done tests on this parameter and inference speed seems to be not affected. Thanks.
well, as far as I experienced, its mainly because of your sampling rate is only 8k. I trained on 22050Hz dataset, fp16, default params, and for 11 second synthesize (max output audio length when using default decode length), it took more than 2 seconds on RTX 2080ti.
@fatihkiralioglu can you kindly send me your hparams.py
file? I will down sample my dataset to train on 8kHz too. All I need is speed.
hparams.zip I have added as attachment.
hparams.zip I have added as attachment.
Thank you! Your audio files is 16 bit PCM right? Does it have noise or echo in inference result?
Yes, audio files are in 16bit pcm
how about others configurations you mentioned here? what is the result? I thought it need to tune more params than just mel_fmax to work with different sampling rate https://github.com/NVIDIA/tacotron2/issues/359#event-3392003764
Yes, I have used exactly those parameters, it is the same config as the attachment. Modifiying win_length and hop_length may give better results, I may also try them in future
@EuphoriaCelestial by the way, just to make sure, in tacotron2 project, there is no support for multi GPU inference right? There is support for multi gpu training but I could not find anything indicating multi gpu inference. Thanks.
@EuphoriaCelestial by the way, just to make sure, in tacotron2 project, there is no support for multi GPU inference right? There is support for multi gpu training but I could not find anything indicating multi gpu inference. Thanks.
yeah, but you can do multi GPUs inference with some simple threading code. Or you can just load 1 model to each GPU; actually depend on your model size and GPU's VRAM, you can load more than 1. In my case I can load 2 model to each RTX 2080ti
@fatihkiralioglu I have trained a new model on 8k dataset using your params, the result after 80k steps is kind of strange, hear this: 80000.zip
note that the voice in my dataset is female and the pitch much higher than in this file, this is a sample in training dataset: train_sample.zip
do I need to train a 8k waveglow model as well? or it just need to take more steps to train this tacotron model?
@fatihkiralioglu how did you come up with those number in hparams file? I want to know how to calculate them so I can train with diffirent sampling rate as well. Because 8000Hz sound like a bad radio speaker, I would like something around 16k. How about your result?
@EuphoriaCelestial , in notebook, did you set the sampling rate correctly to 8kHz? Note that you should also train corresponding waveglow model for 8kHz.
for configuration parameters, we need to select 50ms frame sizes with 12,5ms hop lengths. for the case of sampling rate of 8kHz, 50ms means 8000*0.05 = 400. Thefore, the parameters should be:
filter_length=400 hop_length=100 win_length=400 n_mel_channels=80
I guess we need a similar config update for waveglow training too.
in notebook, did you set the sampling rate correctly to 8kHz?
Yes I did
Note that you should also train corresponding waveglow model for 8kHz.
I am training a 8k Waveglow, using this config (same as tacotron params):
filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80,
I thought you said earlier that you only change mel_fmax for different sample rate? have you tried those params for tacotron?
Thefore, the parameters should be: filter_length=400 hop_length=100 win_length=400 n_mel_channels=80
Yes, I have trained a 8khz model by just modifying sample rate but resulting model was too noisy, Currently I'm trying to train a new tacotron2 and waveglow model for 8kHz. Hopefully I can share the initial results by Monday
Yes, I have trained a 8khz model by just modifying sample rate but resulting model was too noisy, Currently I'm trying to train a new tacotron2 and waveglow model for 8kHz. Hopefully I can share the initial results by Monday
do you encounter gradient overflow like me in this issue ? https://github.com/NVIDIA/waveglow/issues/205#
No, I always use batch size of 12, in fact, I'm not sure if a batch size of 1 will work
No, I always use batch size of 12, in fact, I'm not sure if a batch size of 1 will work
I am training on 2080 with batch size of 24 now, but gradient overflow still happen; although it still training and the result is still getting better
@fatihkiralioglu can you please give me the config.json file you've used for 8k waveglow model? my training got crashed again
@EuphoriaCelestial, I have trained a waveglow model for a sample rate of 8kHz with params: filter_length=400 hop_length=100 win_length=400 n_mel_channels=40
But results are unintelligible and training failed. Thefore, I'm trying with the parameters: filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=40
But checkpoint results seems still very bad. Hoewer there parameters produce a good waveglow vocoder model for 8kHz:
filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=80
@EuphoriaCelestial, I have trained a waveglow model for a sample rate of 8kHz with params: filter_length=400 hop_length=100 win_length=400 n_mel_channels=40
But results are unintelligible and training failed. Thefore, I'm trying with the parameters: filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=40
But checkpoint results seems still very bad. Hoewer there parameters produce a good waveglow vocoder model for 8kHz:
filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=80
thank you! I will try those config. How long does it take to confirm a model will produce good result or not? how many epochs?
@EuphoriaCelestial it took a gtx 1080 gpu about 2 days to get intelligible results
@fatihkiralioglu how about those params? leave them as default?
"segment_length", "mel_fmin", "mel_fmax"
I have changed segment_length to 8000 and mel_fmax to 4000 for 8kHz training
@fatihkiralioglu @EuphoriaCelestial could you share please generated audio samples?
@fatihkiralioglu @EuphoriaCelestial could you please generated audio samples?
sorry for late respond, I was really busy recently, this is my samples for 16k sample16k.zip
@EuphoriaCelestial, synthesis is quite good, but I guess that you did not train speaker-specific waveglow model here. If you use a custom trained waveglow model, synthesis quality will be higher.
@EuphoriaCelestial, synthesis is quite good, but I guess that you did not train speaker-specific waveglow model here. If you use a custom trained waveglow model, synthesis quality will be higher.
I use the same data to train both waveglow and tacotron, thus I think it is speaker-specific, right?
@EuphoriaCelestial, yes, it is ok then, what is your iteration count for waveglow model? By the way, is there any background noise your training audio data? It strongly effects model quality.
@fatihkiralioglu I wrote it in the files name, all 4 samples is synthesized with same tacotron checkpoint, just different waveglow checkpoint and different denoiser strength
@EuphoriaCelestial, yes, it is ok then, what is your iteration count for waveglow model? By the way, is there any background noise your training audio data? It strongly effects model quality.
no there is no background noise since it is recorded in isolated room. I think there will be some noise from recording device, but I can barely able to hear it
sorry for late respond, I was really busy recently, this is my samples for 16k sample16k.zip
Results are quite good, did you trained tacotron2 and waveglow from scratch? looks like generated audios still has little background noise. I've also got some little background noise for pre-training tacotron2 using pre-trained model on 7 hours of data.
I have some background noise in the original voices.. pls check results here
I have some background noise in the original voices.. pls check results here
I dont think your synthesize sound robotic, or maybe its because of I am not a native English speaker so I cant recognize the different Have you tried increase denoiser strength or increase sigma a little bit? btw, isnt 7hours dataset is too small?
Results are quite good, did you trained tacotron2 and waveglow from scratch?
yes I did train both tacotron and waveglow model from scratch, with more than 40 hours of data, trim the silence at the beginning and the end, down sample, ....
looks like generated audios still has little background noise.
I think if I continue the training of waveglow, it will eliminate the background noise, for now, I choose to just increase the denoiser strength
I have some background noise in the original voices.. pls check results here
I dont think your synthesize sound robotic, or maybe its because of I am not a native English speaker so I cant recognize the different Have you tried increase denoiser strength or increase sigma a little bit? btw, isnt 7hours dataset is too small?
yes I used different values on denoiser but no any improvements.. I trained 7 hours of data using pretrained tacotron2 model.. I haven't trained from scratch.
How much data set do I need to have have for more natural voices?
@EuphoriaCelestial could you please share your mailid, I do have some questions about waveglow. I would like to connect.
How much data set do I need to have have for more natural voices?
I used 19 hours for my first tacotron model, it give very good result after 1000 epochs
@EuphoriaCelestial could you please share your mailid, I do have some questions about waveglow. I would like to connect.
my email: nthanhha26@gmail.com I am no expert in this, but I will try my best to help
@EuphoriaCelestial Hello again, Thanks for the mail id.
which method did you use to downsample the audios files?
I'm using this code to down-sample is that correct?
@EuphoriaCelestial Hello again, Thanks for the mail id.
which method did you use to downsample the audios files?
I'm using this code to down-sample is that correct?
I think it is 1 of many correct ways to down sample. I used sox to do so, and it also reduced the audio quality of dataset, but its not noticeable for normal users
@EuphoriaCelestial If I want the final model should synthesize in audios with 8k sampling rate. if I change sr 22050 to 8000 what other parameters need to change for both tacotron2 and wave glow. could you please share hparams.py for tacotron2 and config for waveglow.
and I've 8 hours of data only can I retrain tacotron2 and wave glow with new sr? or do I need to train from scratch.
@EuphoriaCelestial If I want the final model should synthesize in audios with 8k sampling rate. if I change sr 22050 to 8000 what other parameters need to change for both tacotron2 and wave glow. could you please share hparams.py for tacotron2 and config for waveglow.
you can find the all configurations I used above, just scroll up a bit
and I've 8 hours of data only can I retrain tacotron2 and wave glow with new sr? or do I need to train from scratch.
I dont understand this part. You will need to train from scratch both tacotron and waveglow model for new sample rate
@EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.
If I need to train from scratch I think I don't have enough data.
@EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.
If I need to train from scratch I think I don't have enough data.
I am not sure but I think you will need to train from scratch. At least for Waveglow model; Tacotron model may still be able to transfer learning but will give worse result
I think working with an 8 khz sampling rate is slightly analogous to quantization (e.g. lowering 32 bit fp to int8), even with a lot of data. It's a research problem. Having said that, it would be interesting to see what the best quality sound would be if we did use infinite amounts of data.
I wonder what the current wisdom on this is.
On Sun, Jul 5, 2020, 7:33 PM EuphoriaCelestial notifications@github.com wrote:
@EuphoriaCelestial https://github.com/EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.
If I need to train from scratch I think I don't have enough data.
I am not sure but I think you will need to train from scratch. At least for Waveglow model; Tacotron model may still be able to transfer learning but will give worse result
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/348#issuecomment-653983792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEGOVIIWNVEQ2ONFVNAFC3R2EZXBANCNFSM4M5ZJBBQ .
I tried 8khz sampling rate before, with more than 40 hours dataset, train from sratch. Both Tacotron and Waveglow model align very quick, I can get good result around 300.000 steps, but there was too much noise, and the voice was too unreal; so I decided not to use it
@fatihkiralioglu did you got good quality natural voices with sr= 8000 Could you pls share your samples .
Hello, Unfortunately, I could not get a decent, noise-free voice with 8kHz data. I guess that winlength shoudl be: 400 and hop length: 100 while filter length of 1024 should be constant. But I have trained with winlength 1024, therefore, the results were too noisy. @EuphoriaCelestial, could you share your trained waveglow model if possible, so that we can check it for other languages and speaker databases, or use it as a warmsstart for trained speaker specific vocoders?
https://github.com/NVIDIA/waveglow/issues/54 In this issue, they were talking about lower some parameters to maximize inference speed. But I dont know how to do it properly, what can be reduced and what need to remain. Anyone did this before? Please send me your hparams configuration.
if I trained my model using fp32, can it run inference in fp16 and vice versa? in this case, will it impove inference speed? I am using RTX 2080ti, my model run 7 times faster than real-time, and I am pretty sure it can be improved
and one more thing, is there any benefit of running inference using multi-GPUs?