NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.12k stars 1.39k forks source link

Optimize model for inference speed #348

Closed EuphoriaCelestial closed 4 years ago

EuphoriaCelestial commented 4 years ago

https://github.com/NVIDIA/waveglow/issues/54 In this issue, they were talking about lower some parameters to maximize inference speed. But I dont know how to do it properly, what can be reduced and what need to remain. Anyone did this before? Please send me your hparams configuration.

if I trained my model using fp32, can it run inference in fp16 and vice versa? in this case, will it impove inference speed? I am using RTX 2080ti, my model run 7 times faster than real-time, and I am pretty sure it can be improved

and one more thing, is there any benefit of running inference using multi-GPUs?

rafaelvalle commented 4 years ago

Yes, you can run inference on FP16 regardless of how you trained. FP16 should improve inference as long as your GPU has tensor cores.

EuphoriaCelestial commented 4 years ago

how about the parameters tuning? this is my main concern since I have already using fp16 I can re-train model if needed, all I need is speed

fatihkiralioglu commented 4 years ago

Hi, I have trained a tacotron2 model for a 8kHz dataset and inference speed is amazing. for a 10 second synthesis, average inference duration is about 0.5 second for a Tesla P100. What I wonder is that is fp16_run really effective? I have done tests on this parameter and inference speed seems to be not affected. Thanks.

EuphoriaCelestial commented 4 years ago

Hi, I have trained a tacotron2 model for a 8kHz dataset and inference speed is amazing. for a 10 second synthesis, average inference duration is about 0.5 second for a Tesla P100. What I wonder is that is fp16_run really effective? I have done tests on this parameter and inference speed seems to be not affected. Thanks.

well, as far as I experienced, its mainly because of your sampling rate is only 8k. I trained on 22050Hz dataset, fp16, default params, and for 11 second synthesize (max output audio length when using default decode length), it took more than 2 seconds on RTX 2080ti.

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu can you kindly send me your hparams.py file? I will down sample my dataset to train on 8kHz too. All I need is speed.

fatihkiralioglu commented 4 years ago

hparams.zip I have added as attachment.

EuphoriaCelestial commented 4 years ago

hparams.zip I have added as attachment.

Thank you! Your audio files is 16 bit PCM right? Does it have noise or echo in inference result?

fatihkiralioglu commented 4 years ago

Yes, audio files are in 16bit pcm

EuphoriaCelestial commented 4 years ago

how about others configurations you mentioned here? what is the result? I thought it need to tune more params than just mel_fmax to work with different sampling rate https://github.com/NVIDIA/tacotron2/issues/359#event-3392003764

fatihkiralioglu commented 4 years ago

Yes, I have used exactly those parameters, it is the same config as the attachment. Modifiying win_length and hop_length may give better results, I may also try them in future

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial by the way, just to make sure, in tacotron2 project, there is no support for multi GPU inference right? There is support for multi gpu training but I could not find anything indicating multi gpu inference. Thanks.

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial by the way, just to make sure, in tacotron2 project, there is no support for multi GPU inference right? There is support for multi gpu training but I could not find anything indicating multi gpu inference. Thanks.

yeah, but you can do multi GPUs inference with some simple threading code. Or you can just load 1 model to each GPU; actually depend on your model size and GPU's VRAM, you can load more than 1. In my case I can load 2 model to each RTX 2080ti

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu I have trained a new model on 8k dataset using your params, the result after 80k steps is kind of strange, hear this: 80000.zip

note that the voice in my dataset is female and the pitch much higher than in this file, this is a sample in training dataset: train_sample.zip

do I need to train a 8k waveglow model as well? or it just need to take more steps to train this tacotron model?

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu how did you come up with those number in hparams file? I want to know how to calculate them so I can train with diffirent sampling rate as well. Because 8000Hz sound like a bad radio speaker, I would like something around 16k. How about your result?

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial , in notebook, did you set the sampling rate correctly to 8kHz? Note that you should also train corresponding waveglow model for 8kHz.

for configuration parameters, we need to select 50ms frame sizes with 12,5ms hop lengths. for the case of sampling rate of 8kHz, 50ms means 8000*0.05 = 400. Thefore, the parameters should be:

filter_length=400 hop_length=100 win_length=400 n_mel_channels=80

I guess we need a similar config update for waveglow training too.

EuphoriaCelestial commented 4 years ago

in notebook, did you set the sampling rate correctly to 8kHz?

Yes I did

Note that you should also train corresponding waveglow model for 8kHz.

I am training a 8k Waveglow, using this config (same as tacotron params):

filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80,

I thought you said earlier that you only change mel_fmax for different sample rate? have you tried those params for tacotron?

Thefore, the parameters should be: filter_length=400 hop_length=100 win_length=400 n_mel_channels=80

fatihkiralioglu commented 4 years ago

Yes, I have trained a 8khz model by just modifying sample rate but resulting model was too noisy, Currently I'm trying to train a new tacotron2 and waveglow model for 8kHz. Hopefully I can share the initial results by Monday

EuphoriaCelestial commented 4 years ago

Yes, I have trained a 8khz model by just modifying sample rate but resulting model was too noisy, Currently I'm trying to train a new tacotron2 and waveglow model for 8kHz. Hopefully I can share the initial results by Monday

do you encounter gradient overflow like me in this issue ? https://github.com/NVIDIA/waveglow/issues/205#

fatihkiralioglu commented 4 years ago

No, I always use batch size of 12, in fact, I'm not sure if a batch size of 1 will work

EuphoriaCelestial commented 4 years ago

No, I always use batch size of 12, in fact, I'm not sure if a batch size of 1 will work

I am training on 2080 with batch size of 24 now, but gradient overflow still happen; although it still training and the result is still getting better

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu can you please give me the config.json file you've used for 8k waveglow model? my training got crashed again

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial, I have trained a waveglow model for a sample rate of 8kHz with params: filter_length=400 hop_length=100 win_length=400 n_mel_channels=40

But results are unintelligible and training failed. Thefore, I'm trying with the parameters: filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=40

But checkpoint results seems still very bad. Hoewer there parameters produce a good waveglow vocoder model for 8kHz:

filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=80

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial, I have trained a waveglow model for a sample rate of 8kHz with params: filter_length=400 hop_length=100 win_length=400 n_mel_channels=40

But results are unintelligible and training failed. Thefore, I'm trying with the parameters: filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=40

But checkpoint results seems still very bad. Hoewer there parameters produce a good waveglow vocoder model for 8kHz:

filter_length=1024 hop_length=256 win_length=1024 n_mel_channels=80

thank you! I will try those config. How long does it take to confirm a model will produce good result or not? how many epochs?

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial it took a gtx 1080 gpu about 2 days to get intelligible results

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu how about those params? leave them as default?

"segment_length", "mel_fmin", "mel_fmax"

fatihkiralioglu commented 4 years ago

I have changed segment_length to 8000 and mel_fmax to 4000 for 8kHz training

aishweta commented 4 years ago

@fatihkiralioglu @EuphoriaCelestial could you share please generated audio samples?

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu @EuphoriaCelestial could you please generated audio samples?

sorry for late respond, I was really busy recently, this is my samples for 16k sample16k.zip

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial, synthesis is quite good, but I guess that you did not train speaker-specific waveglow model here. If you use a custom trained waveglow model, synthesis quality will be higher.

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial, synthesis is quite good, but I guess that you did not train speaker-specific waveglow model here. If you use a custom trained waveglow model, synthesis quality will be higher.

I use the same data to train both waveglow and tacotron, thus I think it is speaker-specific, right?

fatihkiralioglu commented 4 years ago

@EuphoriaCelestial, yes, it is ok then, what is your iteration count for waveglow model? By the way, is there any background noise your training audio data? It strongly effects model quality.

EuphoriaCelestial commented 4 years ago

@fatihkiralioglu I wrote it in the files name, all 4 samples is synthesized with same tacotron checkpoint, just different waveglow checkpoint and different denoiser strength

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial, yes, it is ok then, what is your iteration count for waveglow model? By the way, is there any background noise your training audio data? It strongly effects model quality.

no there is no background noise since it is recorded in isolated room. I think there will be some noise from recording device, but I can barely able to hear it

aishweta commented 4 years ago

sorry for late respond, I was really busy recently, this is my samples for 16k sample16k.zip

Results are quite good, did you trained tacotron2 and waveglow from scratch? looks like generated audios still has little background noise. I've also got some little background noise for pre-training tacotron2 using pre-trained model on 7 hours of data.

aishweta commented 4 years ago

I have some background noise in the original voices.. pls check results here

EuphoriaCelestial commented 4 years ago

I have some background noise in the original voices.. pls check results here

I dont think your synthesize sound robotic, or maybe its because of I am not a native English speaker so I cant recognize the different Have you tried increase denoiser strength or increase sigma a little bit? btw, isnt 7hours dataset is too small?

EuphoriaCelestial commented 4 years ago

Results are quite good, did you trained tacotron2 and waveglow from scratch?

yes I did train both tacotron and waveglow model from scratch, with more than 40 hours of data, trim the silence at the beginning and the end, down sample, ....

looks like generated audios still has little background noise.

I think if I continue the training of waveglow, it will eliminate the background noise, for now, I choose to just increase the denoiser strength

aishweta commented 4 years ago

I have some background noise in the original voices.. pls check results here

I dont think your synthesize sound robotic, or maybe its because of I am not a native English speaker so I cant recognize the different Have you tried increase denoiser strength or increase sigma a little bit? btw, isnt 7hours dataset is too small?

yes I used different values on denoiser but no any improvements.. I trained 7 hours of data using pretrained tacotron2 model.. I haven't trained from scratch.

How much data set do I need to have have for more natural voices?

aishweta commented 4 years ago

@EuphoriaCelestial could you please share your mailid, I do have some questions about waveglow. I would like to connect.

EuphoriaCelestial commented 4 years ago

How much data set do I need to have have for more natural voices?

I used 19 hours for my first tacotron model, it give very good result after 1000 epochs

@EuphoriaCelestial could you please share your mailid, I do have some questions about waveglow. I would like to connect.

my email: nthanhha26@gmail.com I am no expert in this, but I will try my best to help

aishweta commented 4 years ago

@EuphoriaCelestial Hello again, Thanks for the mail id.

which method did you use to downsample the audios files?

I'm using this code to down-sample is that correct?

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial Hello again, Thanks for the mail id.

which method did you use to downsample the audios files?

I'm using this code to down-sample is that correct?

I think it is 1 of many correct ways to down sample. I used sox to do so, and it also reduced the audio quality of dataset, but its not noticeable for normal users

aishweta commented 4 years ago

@EuphoriaCelestial If I want the final model should synthesize in audios with 8k sampling rate. if I change sr 22050 to 8000 what other parameters need to change for both tacotron2 and wave glow. could you please share hparams.py for tacotron2 and config for waveglow.

and I've 8 hours of data only can I retrain tacotron2 and wave glow with new sr? or do I need to train from scratch.

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial If I want the final model should synthesize in audios with 8k sampling rate. if I change sr 22050 to 8000 what other parameters need to change for both tacotron2 and wave glow. could you please share hparams.py for tacotron2 and config for waveglow.

you can find the all configurations I used above, just scroll up a bit

and I've 8 hours of data only can I retrain tacotron2 and wave glow with new sr? or do I need to train from scratch.

I dont understand this part. You will need to train from scratch both tacotron and waveglow model for new sample rate

aishweta commented 4 years ago

@EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.

If I need to train from scratch I think I don't have enough data.

EuphoriaCelestial commented 4 years ago

@EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.

If I need to train from scratch I think I don't have enough data.

I am not sure but I think you will need to train from scratch. At least for Waveglow model; Tacotron model may still be able to transfer learning but will give worse result

pravn commented 4 years ago

I think working with an 8 khz sampling rate is slightly analogous to quantization (e.g. lowering 32 bit fp to int8), even with a lot of data. It's a research problem. Having said that, it would be interesting to see what the best quality sound would be if we did use infinite amounts of data.

I wonder what the current wisdom on this is.

On Sun, Jul 5, 2020, 7:33 PM EuphoriaCelestial notifications@github.com wrote:

@EuphoriaCelestial https://github.com/EuphoriaCelestial What I mean is I just have 8 hours of data and If I change sampling rate do I need to train from scratch or I can still do transfer learning using tacotron2 pre-trained model which is available on the readme.

If I need to train from scratch I think I don't have enough data.

I am not sure but I think you will need to train from scratch. At least for Waveglow model; Tacotron model may still be able to transfer learning but will give worse result

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/348#issuecomment-653983792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEGOVIIWNVEQ2ONFVNAFC3R2EZXBANCNFSM4M5ZJBBQ .

EuphoriaCelestial commented 4 years ago

I tried 8khz sampling rate before, with more than 40 hours dataset, train from sratch. Both Tacotron and Waveglow model align very quick, I can get good result around 300.000 steps, but there was too much noise, and the voice was too unreal; so I decided not to use it

aishweta commented 4 years ago

@fatihkiralioglu did you got good quality natural voices with sr= 8000 Could you pls share your samples .

fatihkiralioglu commented 4 years ago

Hello, Unfortunately, I could not get a decent, noise-free voice with 8kHz data. I guess that winlength shoudl be: 400 and hop length: 100 while filter length of 1024 should be constant. But I have trained with winlength 1024, therefore, the results were too noisy. @EuphoriaCelestial, could you share your trained waveglow model if possible, so that we can check it for other languages and speaker databases, or use it as a warmsstart for trained speaker specific vocoders?