Open Daoud-babaammi opened 1 month ago
no.
spk = self.embed_spk(spk_id) + spk_emb
Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.
Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.
To train a model that supports unseen target wav, just remove the self.embed_spk(spk_id)
related parts (spk_id
in dataloader, self.embed_spk
in model, etc.), and only use spk_emb
extracted from speaker encoder to represent a speaker. During inference, extract spk_emb
from unseen target wav with speaker encoder, calculate mean pitch value from unseen target wav as f0_mean_tgt
.
But I'm lazy to train a new model.
There's one way to mimic unseen target wav inference with the provided checkpoint:
spk_emb
from unseen target wav with speaker encoderspk_emb
and all spk_emb
s in the datasetspk_emb
with the smallest cosine distance, and choose the corresponding spk_id
f0_mean_tgt
Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.
To train a model that supports unseen target wav, just remove the
self.embed_spk(spk_id)
related parts (spk_id
in dataloader,self.embed_spk
in model, etc.), and only usespk_emb
extracted from speaker encoder to represent a speaker. During inference, extractspk_emb
from unseen target wav with speaker encoder, calculate mean pitch value from unseen target wav asf0_mean_tgt
.But I'm lazy to train a new model.
There's one way to mimic unseen target wav inference with the provided checkpoint:
- extract
spk_emb
from unseen target wav with speaker encoder- calculated cosine distances between this unseen
spk_emb
and allspk_emb
s in the dataset- choose the
spk_emb
with the smallest cosine distance, and choose the correspondingspk_id
- calculate mean pitch value from unseen target wav as
f0_mean_tgt
Thank you, I will try the 2nd option first. Later if I have time, I will do the retraining. By the way, I am working on cross-lingual voice conversion, and your FreeVC and PitchVC work very well. Thank you!
@zhenhaoge What language did you train ?
@zhenhaoge What language did you train ?
@leminhnguyen in my case, the source speaker speaks Ukrainian, and the target speaker speaks English, i.e., enabling the English speaker to speak Ukrainian. So if I need to retrain, I need to train it in Ukrainian speech.
When using cross-lingual with the pretrained model, English speaker as the target, Vietnamese speaker as the source the results is good. But when I train from scratch or finetune for Vietnamese from the pretrained, then try voice conversion with both source and target are Vietnamese the output audio is strange and not good as the results for english. It seems a different speaker & I think it has a problem with the pitch.
@leminhnguyen Sure, I will first try the method using the current model as @OlaWod mentioned, later I may train for another language. I will keep you posted.
When using cross-lingual with the pretrained model, English speaker as the target, Vietnamese speaker as the source the results is good. But when I train from scratch or finetune for Vietnamese from the pretrained, then try voice conversion with both source and target are Vietnamese the output audio is strange and not good as the results for english. It seems a different speaker & I think it has a problem with the pitch.
- @zhenhaoge you can train for Ukrainian and confirm that? If you have good results let's me know.
- @OlaWod did you have any suggestion to train on other languages ?
i think this should be language agnostic.
check if manually adjust f0 here when infer makes it better?
hello,do you have encounter this question in the reference?
Traceback (most recent call last):
File "/root/autodl-tmp/PitchVC/convert_sp.py", line 168, in
please , i have a question , can we inference on an unseen audios target ? i see that you have to convert them into npy, but is that all ?