Open eschmidbauer opened 8 months ago
You are right and I'm very very sorry for the incomplete code. I'm very busy recently...... I will complete the code as soon as possible. About how to generate the npy files, I refer to the https://github.com/PlayVoice/lora-svc to generate the whisper encoder output. In lora-svc, it is
3, use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
I was able to extract ppg using that script but now i am getting this error
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/SpectralOps.cpp:879.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 130, in <module>
tts_en(text, spk)
File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 96, in tts_en
audio, *_ = net_g.voice_conversion_new(x_tst, x_tst_lengths, mel=ref_mel, lang=torch.LongTensor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 791, in voice_conversion_new
z_weo, m_q_weo, logs_q_weo, y_mask_weo = self.enc_whisper(weo, weo_lengths, g=lang)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 436, in forward
x = self.pre(x) * x_mask
^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 310, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Given groups=1, weight of size [192, 1280, 1], expected input[1, 1024, 429] to have 1280 channels, but got 1024 channels instead
Because the whisper model in lorasvc is medium version. But in my research I use the large-v2 version of whisper. The channel of medium and large-v2 is different.
That makes sense, thank you !
You can refer to this to download v2: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/__init__.py#L27
That makes sense, thank you !
Hello, I want to ask you, how many steps do you usually train to hear the voice?
You can refer to this to download v2: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/__init__.py#L27 Hello, I want to ask you, how many steps do you usually train to hear the voice?
Hello, thank you for sharing this code. I'm trying to figure out how to run VC. But there appears to be a reference to
.npy
files. There is no code in the project to generate the.npy
files. Can you share that step?