OlaWod / PitchVC

PitchVC: Pitch Conditioned Any-to-Many Voice Conversion
MIT License
34 stars 4 forks source link

Did you have the paper for this repo? #1

Open leminhnguyen opened 4 months ago

OlaWod commented 4 months ago

no.

OlaWod commented 4 months ago

@OlaWod Did you compare the results with other models like FreeVC, KNN-VC ?

  1. this one is any-to-many and supports converting to seen speakers
  2. the similarity of this one is influenced by accuracy of pitch adjustment
  3. the seen speaker objective similarity (by resemblyzer):
src vctk libritts esd(emotive, en) esd(neutral, zh)
knnvc (5min matching set) 82.50% 86.53% 82.95% 82.95%
freevc 83.89% 87.34% 83.14% 83.77%
pitchvc 84.96% 87.96% 83.14% 83.29%

(source wav sampled from datasets in the first line, 500 wavs each; converted to 12 seen speakers)

  1. the subjective similarity, I personally think this one is better in most cases
  2. the naturalness is apparently better than freevc and knnvc
leminhnguyen commented 4 months ago

@OlaWod I see this model is better than KNN-VC as well as Phoneme Hallucinator about intelligibility (WER). That is amazing!!!

  1. How did you do that? From my understanding, you improved this model based on the FreeVC, can you provide the changes in detail in this model?
  2. What do you think about the results for cross-lingual?
OlaWod commented 4 months ago

@OlaWod I see this model is better than KNN-VC as well as Phoneme Hallucinator about intelligibility (WER). That is amazing!!!

  1. How did you do that? From my understanding, you improved this model based on the FreeVC, can you provide the changes in detail in this model?
  2. What do you think about the results for cross-lingual?
  1. model description: https://github.com/OlaWod/PitchVC/blob/main/Description.md 1.1 the improvement mostly come from replacing naive hifigan-like vocoder with hiftnet, a neural source-filter augmented vocoder with pitch as additional input 1.2 the vits-like structure in freevc is not so good 1.2.1 sampling latent representation from a distribution harms audio quality a bit, so i change the overall structure to a simple feedforward structure 1.2.2 with the same number of layers and layer types, flow-based structure has better similarity than simple feedforwad structure. so i just make the decoder bigger, with 20 layers. 1.3 during inference, pitch of source wav needs to be adjusted to fit that of target speaker
  2. i tested results on Chinese source wavs with model trained on vctk (English), the naturalness, intelligibility (CER), F0 contour consistency (F0PCC) are apparently good. about similarity, i can not tell whether it is good or not, i think it is acceptable, but my friends (native Chinese) say they (a converted wav saying Chinese, a reference wav of target speaker saying English) doesn't sound like the same person. i think maybe the accent affects people's judgement.
leminhnguyen commented 4 months ago

Thank you, I'll try this model !!!

leminhnguyen commented 3 months ago

When train the model, I encountered this error:

Traceback (most recent call last):
  File "/home/lmnguyen/PitchVC/train.py", line 325, in <module>
    main()
  File "/home/lmnguyen/PitchVC/train.py", line 321, in main
    train(0, a, h)
  File "/home/lmnguyen/PitchVC/train.py", line 154, in train
    spec, phase = generator(x, mel, spk_emb, spk_id)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/lmnguyen/PitchVC/models.py", line 484, in forward
    g = self.embed_spk(spk_id).transpose(1, 2)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any suggesstion @OlaWod?

leminhnguyen commented 3 months ago

Solved the problem!!! @OlaWod Could you share the pretrained do_* model to finetune ?

OlaWod commented 3 months ago

Solved the problem!!! @OlaWod Could you share the pretrained do_* model to finetune ?

i am outside these days, will do it when back home

OlaWod commented 3 months ago

Solved the problem!!! @OlaWod Could you share the pretrained do_* model to finetune ?

the corresponding do_0070000 in exp/default dir is deleted. i put another checkpoint in exp/test dir. they are not much different, but trained on different machines.

https://onedrive.live.com/?authkey=%21ABsAJ%2DBEtNBkves&id=537643E55991EE7B%21406447&cid=537643E55991EE7B

OlaWod commented 3 months ago

When train the model, I encountered this error:

Traceback (most recent call last):
  File "/home/lmnguyen/PitchVC/train.py", line 325, in <module>
    main()
  File "/home/lmnguyen/PitchVC/train.py", line 321, in main
    train(0, a, h)
  File "/home/lmnguyen/PitchVC/train.py", line 154, in train
    spec, phase = generator(x, mel, spk_emb, spk_id)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/lmnguyen/PitchVC/models.py", line 484, in forward
    g = self.embed_spk(spk_id).transpose(1, 2)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/home/lmnguyen/miniconda3/envs/voice-conversion/lib/python3.9/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any suggesstion @OlaWod?

i suppose it because i hardcoded the speaker number in nn.Embedding to 108 here, but you have more than 108 speakers in your data?

leminhnguyen commented 3 months ago

@OlaWod You're correct, I have more speakers so I change the 108 to my number of speakers.