Open george-roussos opened 4 years ago
Here said 95 for female is better.
can I change this value when inference? or it need to be set before training the model?
Here said 95 for female is better.
can I change this value when inference? or it need to be set before training the model?
I think you need to set them before training.
Here said 95 for female is better.
can I change this value when inference? or it need to be set before training the model?
I think you need to set them before training.
Hi, I want to know how to test the pitch information on my own dataset. Could you plz tell me how to address this issue?
I've tested n_mel
of 80, 160 and 320, fmin
of 0, 100, 20 and full 2400 channel STFT with iso226 weighted loss.
I can say right now, the min frequency you pick will make almost zero difference.
frequency you pick will make almost zero difference.
I've tested
n_mel
of 80, 160 and 320,fmin
of 0, 100, 20 and full 2400 channel STFT with iso226 weighted loss. I can say right now, the min frequency you pick will make almost zero difference.
Do you mean that the parameter of fmin
would not influence the mel feature?
@QUTGXX If you change fmin, you will have to retrain the vocoder to match. https://github.com/NVIDIA/waveglow/blob/master/config.json#L21 I've already done this enough times with enough input feature types to confirm that the difference is small and not worth the effort.
Thanks a lot for trying it, I suspected it wouldn't make any difference. The problem is that my speaker does a lot of vocal fry and Taco2 then gets biased towards it whenever I add punctuation (both comma or full stop). And it looks like Taco2 is not very good at modelling glottalisation, it sounds noisy.
@george-roussos How do the original audio files sound? Have you trained a vocoder on your dataset or just using pretrained models?
@george-roussos How do the original audio files sound? Have you trained a vocoder on your dataset or just using pretrained models?
The original files are completely clean and consistent. There is no background noise.
I have tried to train all GANs (PWGAN, MBMelgan, FBMelgan) and WaveGrad. Now I am trying HiFiGAN and it seems to be more lenient with vocal fry, but I think it is a Taco2 issue.
Hi, I have seen this mentioned a couple times, but not really very touched upon. I notice that the TTS I train has trouble with the F0 of my female speaker (it glottalizes too much to the point where it just sounds too hoarse) and in higher frequencies it also sounds a bit funny. Are there special fmin and fmax values for female voices, or is this just a limitation of the TTS? I have always gone with fmin=0 and fmax=8000, but maybe that is wrong. What is the intuition behind using higher fmin values?