[Question] fmin and fmax for female

NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

BSD 3-Clause "New" or "Revised" License

5.08k stars 1.38k forks source link

[Question] fmin and fmax for female #420

Open george-roussos opened 4 years ago

george-roussos commented 4 years ago

Hi, I have seen this mentioned a couple times, but not really very touched upon. I notice that the TTS I train has trouble with the F0 of my female speaker (it glottalizes too much to the point where it just sounds too hoarse) and in higher frequencies it also sounds a bit funny. Are there special fmin and fmax values for female voices, or is this just a limitation of the TTS? I have always gone with fmin=0 and fmax=8000, but maybe that is wrong. What is the intuition behind using higher fmin values?

Welsun commented 4 years ago

Here said 95 for female is better.

EuphoriaCelestial commented 4 years ago

Here said 95 for female is better.

can I change this value when inference? or it need to be set before training the model?

Welsun commented 4 years ago

Here said 95 for female is better.

can I change this value when inference? or it need to be set before training the model?

I think you need to set them before training.

QUTGXX commented 4 years ago

Here said 95 for female is better.

can I change this value when inference? or it need to be set before training the model?

I think you need to set them before training.

Hi, I want to know how to test the pitch information on my own dataset. Could you plz tell me how to address this issue?

CookiePPP commented 4 years ago

I've tested n_mel of 80, 160 and 320, fmin of 0, 100, 20 and full 2400 channel STFT with iso226 weighted loss. I can say right now, the min frequency you pick will make almost zero difference.

QUTGXX commented 4 years ago

frequency you pick will make almost zero difference.

I've tested n_mel of 80, 160 and 320, fmin of 0, 100, 20 and full 2400 channel STFT with iso226 weighted loss. I can say right now, the min frequency you pick will make almost zero difference.

Do you mean that the parameter of fmin would not influence the mel feature?

CookiePPP commented 4 years ago

@QUTGXX If you change fmin, you will have to retrain the vocoder to match. https://github.com/NVIDIA/waveglow/blob/master/config.json#L21 I've already done this enough times with enough input feature types to confirm that the difference is small and not worth the effort.

george-roussos commented 4 years ago

Thanks a lot for trying it, I suspected it wouldn't make any difference. The problem is that my speaker does a lot of vocal fry and Taco2 then gets biased towards it whenever I add punctuation (both comma or full stop). And it looks like Taco2 is not very good at modelling glottalisation, it sounds noisy.

CookiePPP commented 4 years ago

@george-roussos How do the original audio files sound? Have you trained a vocoder on your dataset or just using pretrained models?

george-roussos commented 4 years ago

@george-roussos How do the original audio files sound? Have you trained a vocoder on your dataset or just using pretrained models?

The original files are completely clean and consistent. There is no background noise.

I have tried to train all GANs (PWGAN, MBMelgan, FBMelgan) and WaveGrad. Now I am trying HiFiGAN and it seems to be more lenient with vocal fry, but I think it is a Taco2 issue.