auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
976 stars 207 forks source link

How to use the new hifi-gan model #96

Closed antovespoli3 closed 2 years ago

antovespoli3 commented 2 years ago

I have tried to use the new hifi-gan model that you recently added to the repository, but I can't quite understand if I am doing it right because I can't reproduce any meaningful sound. So what I am doing is the following:

  1. I reshape the generated mel-spectrogram to be (1, num_mels, frames), and I save it as .npy file in the appropriate directory
  2. I execute the command python inference_e2e.py --checkpoint_file ./g_03295000 where the model is the one downloaded from your google drive link

By doing so, I get a very high pitched sound.

I also tried reshaping it to (frames, num_mels, 1) instead of (1, num_mels, frames), and I get a silent file.

Am I doing something wrong?

P.s. I am using the config_v1.json as the configuration file for the model

auspicious3000 commented 2 years ago

Looks like the checkpoint file is problematic. I will update the checkpoint file in a few days. Sorry for the inconvenience.

antovespoli3 commented 2 years ago

Sounds great, thanks.

auspicious3000 commented 2 years ago

Checkpoint updated.

Irislucent commented 2 years ago

Hi author, by using the provided checkpoint file of Hifi-gan to inference from mel-spectrograms extracted from AutoVC make_spect.py, I got a very low voice. What I'm not sure is, what the config.json file of that checkpoint is like? I noticed some tiny differences in the way mel-spectrograms are calculated that could probably cause the issue. AutoVC introduced fmax and fmin (as high as 90hz) to mel-filterbanks, while the original Higi-GAN didn't use these parameters. Thus I wonder what the config.json used to train the vocoder checkpoint is like. Thanks!