Open eschmidbauer opened 6 months ago
You are right and I'm very very sorry for the incomplete code. I'm very busy recently...... I will complete the code as soon as possible. About how to generate the npy files, I refer to the https://github.com/PlayVoice/lora-svc to generate the whisper encoder output. In lora-svc, it is
3, use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
I was able to extract ppg using that script but now i am getting this error
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/SpectralOps.cpp:879.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 130, in <module>
tts_en(text, spk)
File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 96, in tts_en
audio, *_ = net_g.voice_conversion_new(x_tst, x_tst_lengths, mel=ref_mel, lang=torch.LongTensor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 791, in voice_conversion_new
z_weo, m_q_weo, logs_q_weo, y_mask_weo = self.enc_whisper(weo, weo_lengths, g=lang)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 436, in forward
x = self.pre(x) * x_mask
^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 310, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Given groups=1, weight of size [192, 1280, 1], expected input[1, 1024, 429] to have 1280 channels, but got 1024 channels instead
Because the whisper model in lorasvc is medium version. But in my research I use the large-v2 version of whisper. The channel of medium and large-v2 is different.
That makes sense, thank you !
You can refer to this to download v2: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/__init__.py#L27
Hello, thank you for sharing this code. I'm trying to figure out how to run VC. But there appears to be a reference to
.npy
files. There is no code in the project to generate the.npy
files. Can you share that step?