X-E-Speech / X-E-Speech-code

X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion
MIT License
63 stars 8 forks source link

VC code incomplete #2

Open eschmidbauer opened 6 months ago

eschmidbauer commented 6 months ago

Hello, thank you for sharing this code. I'm trying to figure out how to run VC. But there appears to be a reference to .npy files. There is no code in the project to generate the .npy files. Can you share that step?

X-E-Speech commented 6 months ago

You are right and I'm very very sorry for the incomplete code. I'm very busy recently...... I will complete the code as soon as possible. About how to generate the npy files, I refer to the https://github.com/PlayVoice/lora-svc to generate the whisper encoder output. In lora-svc, it is

3, use 16K audio to extract ppg

python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

eschmidbauer commented 6 months ago

I was able to extract ppg using that script but now i am getting this error

Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/SpectralOps.cpp:879.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
  0%|                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 130, in <module>
    tts_en(text, spk)
  File "X-E-Speech-code/inference-cross-lingual-emotional-VC.py", line 96, in tts_en
    audio, *_ = net_g.voice_conversion_new(x_tst, x_tst_lengths, mel=ref_mel, lang=torch.LongTensor(
  File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 791, in voice_conversion_new
    z_weo, m_q_weo, logs_q_weo, y_mask_weo = self.enc_whisper(weo, weo_lengths, g=lang)
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "X-E-Speech-code/models_whisper_hier_multi_pure.py", line 436, in forward
    x = self.pre(x) * x_mask
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "X-E-Speech-code/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 1280, 1], expected input[1, 1024, 429] to have 1280 channels, but got 1024 channels instead
X-E-Speech commented 6 months ago

Because the whisper model in lorasvc is medium version. But in my research I use the large-v2 version of whisper. The channel of medium and large-v2 is different.

eschmidbauer commented 6 months ago

That makes sense, thank you !

X-E-Speech commented 6 months ago

You can refer to this to download v2: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/__init__.py#L27