Open aishweta opened 4 years ago
Your dataset passed the proof of concept part, now you train from scratch (both tacotron and waveglow) to get better results.
@ksaidin I've very less dataset only 7 hours, as per my understanding I need more than 20 hours of data from scratch training. is that right?
There is no strict 20 hrs rule. You can give it a try, there are various methods to boost your dataset or penalize model while training if needed.
is training waveglow will improve my results?
The samples you shared sound really good! You can use iZotope RX to remove the background noise. I assume this background noise is in your training data otherwise it wouldn't be present during inference. Hence, remove the background noise from all samples and train again.
You can change Waveglow's Denoiser denoising strength to get rid of the high whistle like sound. Also make sure you're using the latest waveglow waveglow_256channels_universal_v5.
@rafaelvalle Thanks for the reply.
Yes, I've some background noise in my data. I tried to remove noise using audacity, new data hopefully, I'll train in next week. Btw Thanks for suggesting the approach to eliminate the noise, I'll try.
I used waveglow's denoiser by increasing and decreasing sigma values, I haven't seen much difference. audios still sound robotic.
I tried to use waveglow waveglow_256channels_universal_v5 latest pre-trained model, I was getting some error: "'WN' object has no attribute 'cond_layer'"
@ksaidin I tried this and I changed code glow.py and convert_model.py but I got nan values after waveglow.
But I've values in mel_postnets
@shwetagargade216 instead of changing code of glow.py, just update whole "tacotron2/waveglow" from https://github.com/NVIDIA/waveglow/
@shwetagargade216 Hi does you resolve the problem of robotic voice? I'm facing the same problem...
@luvwinnie yes my problem has resolved. I have changed the sampling rate using 22050 to 8000 using ffmg. along with it I have dependent parameters and trained both Tacotron2 and waveglow. I go better results than before.
@shwetagargade216 Thank you so much! So from your experience, maybe means that we suppose to use lower frequency of data to train the model instead 22050Hz or even 41kHz?
@rafaelvalle @pravn @ksaidin @CookiePPP
I have American English accents female 7.4 hours of audio dataset. I've removed start and end silence using @Yeongtae yeongtae's processing
I've trained data using transfer learning using nvdia-tacotron2 pre-tarined model. I got good alignment plot at 30100 steps. For audio synthesis I used pre-trained waveglow_256channels.pt
Questions:
Any comments?
Please check Audio samples