bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"
https://bshall.github.io/UniversalVocoding/
MIT License
238 stars 41 forks source link

About Speaker Voice #17

Closed shoegazerstella closed 4 years ago

shoegazerstella commented 4 years ago

I was playing with the preprocessing parameters and I was able to change a bit the sound of the synthesized voice. I was wondering if there was a clever way to to do it in terms of pitch, energy, style, timbre etc.. Thanks!

bshall commented 4 years ago

Hi @shoegazerstella,

It's fun to mess with the inputs but I think changing the speech characteristics in any systematic way is pretty difficult. I remember the issue in #3 was that changing num_fft resulted in a pitch shift. I think a more principled method would be vocal tract length perturbation (see "Vocal tract length perturbation (VTLP) improves speech recognition" for details). It's relatively easy to mess with the mel filters in librosa so that'd be a simple place to start.

Otherwise, if you're interested in changing the speaker entirely I've done some work on voice conversion here. There are also a bunch of papers/repos that convert the spectrogram directly and then synthesize with a vocoder (happy to suggest some if you're interested).

shoegazerstella commented 4 years ago

if you're interested in changing the speaker entirely I've done some work on voice conversion here. There are also a bunch of papers/repos that convert the spectrogram directly and then synthesize with a vocoder (happy to suggest some if you're interested).

Exacly, my aim is to change the speaker entirely.

I was reading more on voice cloning and I did find these two works:

But if I understand well, your approach on voice conversion is a little bit different. I'll look more into it! Would be awesome if you could suggest other approaches too! Thanks a lot!

bshall commented 4 years ago

No problem!

Well, there are two options:

  1. Voice cloning (as you mentioned) - where you synthesize speech from a specific voice from text.
  2. Voice conversion - where you take audio from one speaker and directly convert it to a target speaker.

I think Real-Time-Voice-Cloning the best available open-source project for voice cloning. For voice conversion, there is https://github.com/liusongxiang/StarGAN-Voice-Conversion and https://github.com/auspicious3000/autovc for example.

Hope that helps!

shoegazerstella commented 4 years ago

So yes, the approaches are two indeed. For the TTS part I was using an implementation of FastSpeech2 and to be honest I didn't want to change that because it's super fast in CPU. So I might try both approaches and decide on both quality of results and speed. Again thanks a lot! :)