Wendison / VQMIVC

Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
MIT License
337 stars 55 forks source link

The issue of vocoder in Inference progress #3

Closed TaoTaoFu closed 3 years ago

TaoTaoFu commented 3 years ago

Hi Sir,

Thank you for your sharing firstly.

Now I meet a issure about the inference as below:

raceback (most recent call last): File "convert.py", line 201, in convert(config) File "convert.py", line 194, in convert subprocess.call(cmd) File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 287, in call with Popen(*popenargs, **kwargs) as p: File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 729, in init restore_signals, start_new_session) File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) PermissionError: [Errno 13] Permission denied: 'parallel-wavegan-decode'

What can I do to solve this problem? The pretrain vocoder I have been put in the vocoder dir.

(tts) [xxxx@training VQMIVC]$ ll vocoder/ 总用量 4 lrwxrwxrwx 1 xxxx xxxx 53 6月 25 10:50 checkpoint-3000000steps.pkl -> ../pretrain_model/vocoder/checkpoint-3000000steps.pkl lrwxrwxrwx 1 xxxx xxxx 36 6月 25 10:50 config.yml -> ../pretrain_model/vocoder/config.yml -rw-r--r-- 1 xxxx xxxx 39 6月 24 17:53 README.md lrwxrwxrwx 1 xxxx xxxx 34 6月 25 10:50 stats.h5 -> ../pretrain_model/vocoder/stats.h5

Wendison commented 3 years ago

Hi, did you install ParallelWaveGAN? Installing it should fix your issue, I wiil add this point to README.

TaoTaoFu commented 3 years ago

Hi, did you install ParallelWaveGAN? Installing it should fix your issue, I wiil add this point to README.

ok, thanks, I will try it.

TaoTaoFu commented 3 years ago

Like you said, I can run it now. the voice conversion is work.

But there is an another problem about synthesis results. we can see it together:

the vocoder reproduce the src_wav (p225_022.wav) to src_gen(p225_038_ref_gen.wav), the effect is bad.

the effect of other wavs reproduced from the vocoder model is bad as p225_038_ref_gen.wav.

I'm curious if there are some operation steps of mine is wrong.

Have you encountered the same problem?

p225_022_gen.wav

image

p225_022.wav

image

p225_022_src_gen.wav

image

p225_022_ref_gen.wav

image
Wendison commented 3 years ago

Do you mean you feed the original mel-spectrograms into ParallelWaveGAN to generate waveform? Have you normalized the mel-spectrograms (mean-variance normalization) before the feeding?

TaoTaoFu commented 3 years ago

yes. I just run your convert.py to get all the results. I see the normalization progrecess at the method extract_logmel. Didn't you encounter the same problem while performing the convert.py?

Wendison commented 3 years ago

I haven't encountered this issue before, could you mind uploading some audios for listening?

TaoTaoFu commented 3 years ago

of courese.
demo.zip

TaoTaoFu commented 3 years ago

I think this issue is funny. can you upload some audios reprodcuing from the open convert.py and pretrain_model? I would love to know how our results are different.

Wendison commented 3 years ago

I'm afraid the problem lies in the usage of statistical values of mean/variance for mel-spectrograms. I can obtain your results by using inaccurate mean/variance. I have uploaded my mean/variance in 'mel_stats', and one example to perform VC, i.e., convert_example.py, you can have a try, my converted results are attached FYI. converted.zip

TaoTaoFu commented 3 years ago

great discovery! your converted results is good. I'll try to convert the results with the ‘mel_stats’ you provided. one last question, how did you generate this 'mel_stats' ? Is this file the same file as the mel_stats.npy extracted during doing preprocess.py ?

sauravpd29 commented 3 years ago

Hi, Thank you for sharing this work. I tried your convert_examply.py with your pretrained model and audio inside test gets converted. But when I pass a different audio wav file I get following error saying buffer has wrong number of dimensions. Please help.

PS C:\Users\Saurav\Desktop\cap\VQMIVC> python convert_example.py -s test_wavs/aayush.wav -r test_wavs/didi.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt Traceback (most recent call last):
File "convert_example.py", line 121, in
convert(args)
File "convert_example.py", line 92, in convert
refmel, = extract_logmel(ref_wav_path, mean, std)
File "convert_example.py", line 49, in extract_logmel
f0, timeaxis = pw.dio(wav.astype('float64'), fs, frame_period=frame_period)
File "pyworld/pyworld.pyx", line 93, in pyworld.pyworld.dio ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Wendison commented 3 years ago

great discovery! your converted results is good. I'll try to convert the results with the ‘mel_stats’ you provided. one last question, how did you generate this 'mel_stats' ? Is this file the same file as the mel_stats.npy extracted during doing preprocess.py ?

No, the 'mel_stats' is the one used to train Parallel WaveGAN, it is the same as 'stats.h5' inside 'vocoder' directory of pre-trained models. Besides, I experimented several times by using the 'mel_stats' produced by 'preprocessing.py' to generate wavs, it also worked well for me. Maybe you can compare your 'mel_stats' with the provided 'mel_stats' to see if your results are correct.

Wendison commented 3 years ago

Hi, Thank you for sharing this work. I tried your convert_examply.py with your pretrained model and audio inside test gets converted. But when I pass a different audio wav file I get following error saying buffer has wrong number of dimensions. Please help.

PS C:\Users\Saurav\Desktop\cap\VQMIVC> python convert_example.py -s test_wavs/aayush.wav -r test_wavs/didi.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt Traceback (most recent call last): File "convert_example.py", line 121, in convert(args) File "convert_example.py", line 92, in convert refmel, = extract_logmel(ref_wav_path, mean, std) File "convert_example.py", line 49, in extract_logmel f0, timeaxis = pw.dio(wav.astype('float64'), fs, frame_period=frame_period) File "pyworld/pyworld.pyx", line 93, in pyworld.pyworld.dio ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

It seems your 'wav' has two channels, 'pw.dio' can only process the data of single channel, using the data of single channel as input of 'pw.dio' should fix your issue.

TaoTaoFu commented 3 years ago

great discovery! your converted results is good. I'll try to convert the results with the ‘mel_stats’ you provided. one last question, how did you generate this 'mel_stats' ? Is this file the same file as the mel_stats.npy extracted during doing preprocess.py ?

No, the 'mel_stats' is the one used to train Parallel WaveGAN, it is the same as 'stats.h5' inside 'vocoder' directory of pre-trained models. Besides, I experimented several times by using the 'mel_stats' produced by 'preprocessing.py' to generate wavs, it also worked well for me. Maybe you can compare your 'mel_stats' with the provided 'mel_stats' to see if your results are correct.

Excellent!!! You are right. The key to the problem is the mean and scale. The reason for the difference between the two of us is that the mel_stats.npy I used is inconsistent with your training vocoder. I used the mel_stats.npy generated by the vctk datasets I processed. Thank you very much for your guidance!