NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
4.99k stars 1.37k forks source link

Robotic voice using transfer learning, How to improve it? #373

Open aishweta opened 4 years ago

aishweta commented 4 years ago

@rafaelvalle @pravn @ksaidin @CookiePPP

I have American English accents female 7.4 hours of audio dataset. I've removed start and end silence using @Yeongtae yeongtae's processing

I've trained data using transfer learning using nvdia-tacotron2 pre-tarined model. I got good alignment plot at 30100 steps. For audio synthesis I used pre-trained waveglow_256channels.pt

Questions:

  1. Audio sounds robotic, how can I improve voice quality. (I want more natural sound)
  2. How to eliminate background noise, without affecting voice quality.

Any comments?

Please check Audio samples

30 1

ksaidin commented 4 years ago

Your dataset passed the proof of concept part, now you train from scratch (both tacotron and waveglow) to get better results.

aishweta commented 4 years ago

@ksaidin I've very less dataset only 7 hours, as per my understanding I need more than 20 hours of data from scratch training. is that right?

ksaidin commented 4 years ago

There is no strict 20 hrs rule. You can give it a try, there are various methods to boost your dataset or penalize model while training if needed.

aishweta commented 4 years ago

is training waveglow will improve my results?

rafaelvalle commented 4 years ago

The samples you shared sound really good! You can use iZotope RX to remove the background noise. I assume this background noise is in your training data otherwise it wouldn't be present during inference. Hence, remove the background noise from all samples and train again.

You can change Waveglow's Denoiser denoising strength to get rid of the high whistle like sound. Also make sure you're using the latest waveglow waveglow_256channels_universal_v5.

aishweta commented 4 years ago

@rafaelvalle Thanks for the reply.

Yes, I've some background noise in my data. I tried to remove noise using audacity, new data hopefully, I'll train in next week. Btw Thanks for suggesting the approach to eliminate the noise, I'll try.

I used waveglow's denoiser by increasing and decreasing sigma values, I haven't seen much difference. audios still sound robotic.

ksaidin commented 4 years ago

I tried to use waveglow waveglow_256channels_universal_v5 latest pre-trained model, I was getting some error: "'WN' object has no attribute 'cond_layer'"

Update the Tacotron2/waveglow

aishweta commented 4 years ago

@ksaidin I tried this and I changed code glow.py and convert_model.py but I got nan values after waveglow.

But I've values in mel_postnets

ksaidin commented 4 years ago

@shwetagargade216 instead of changing code of glow.py, just update whole "tacotron2/waveglow" from https://github.com/NVIDIA/waveglow/

luvwinnie commented 3 years ago

@shwetagargade216 Hi does you resolve the problem of robotic voice? I'm facing the same problem...

aishweta commented 3 years ago

@luvwinnie yes my problem has resolved. I have changed the sampling rate using 22050 to 8000 using ffmg. along with it I have dependent parameters and trained both Tacotron2 and waveglow. I go better results than before.

luvwinnie commented 3 years ago

@shwetagargade216 Thank you so much! So from your experience, maybe means that we suppose to use lower frequency of data to train the model instead 22050Hz or even 41kHz?