OlaWod / PitchVC

PitchVC: Pitch Conditioned Any-to-Many Voice Conversion
MIT License
27 stars 4 forks source link

Can we inference on a unseen target wav ? #2

Open Daoud-babaammi opened 1 month ago

Daoud-babaammi commented 1 month ago

please , i have a question , can we inference on an unseen audios target ? i see that you have to convert them into npy, but is that all ?

OlaWod commented 1 month ago

no.

spk = self.embed_spk(spk_id) + spk_emb
zhenhaoge commented 1 month ago

Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.

OlaWod commented 1 month ago

Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.

To train a model that supports unseen target wav, just remove the self.embed_spk(spk_id) related parts (spk_id in dataloader, self.embed_spk in model, etc.), and only use spk_emb extracted from speaker encoder to represent a speaker. During inference, extract spk_emb from unseen target wav with speaker encoder, calculate mean pitch value from unseen target wav as f0_mean_tgt.

But I'm lazy to train a new model.

There's one way to mimic unseen target wav inference with the provided checkpoint:

  1. extract spk_emb from unseen target wav with speaker encoder
  2. calculated cosine distances between this unseen spk_emb and all spk_embs in the dataset
  3. choose the spk_emb with the smallest cosine distance, and choose the corresponding spk_id
  4. calculate mean pitch value from unseen target wav as f0_mean_tgt
zhenhaoge commented 1 month ago

Is it possible to improve the code and inference on an unseen target wav later? Since FreeVC can inference on an unseen target, I hope this version also can.

To train a model that supports unseen target wav, just remove the self.embed_spk(spk_id) related parts (spk_id in dataloader, self.embed_spk in model, etc.), and only use spk_emb extracted from speaker encoder to represent a speaker. During inference, extract spk_emb from unseen target wav with speaker encoder, calculate mean pitch value from unseen target wav as f0_mean_tgt.

But I'm lazy to train a new model.

There's one way to mimic unseen target wav inference with the provided checkpoint:

  1. extract spk_emb from unseen target wav with speaker encoder
  2. calculated cosine distances between this unseen spk_emb and all spk_embs in the dataset
  3. choose the spk_emb with the smallest cosine distance, and choose the corresponding spk_id
  4. calculate mean pitch value from unseen target wav as f0_mean_tgt

Thank you, I will try the 2nd option first. Later if I have time, I will do the retraining. By the way, I am working on cross-lingual voice conversion, and your FreeVC and PitchVC work very well. Thank you!

leminhnguyen commented 1 month ago

@zhenhaoge What language did you train ?

zhenhaoge commented 1 month ago

@zhenhaoge What language did you train ?

@leminhnguyen in my case, the source speaker speaks Ukrainian, and the target speaker speaks English, i.e., enabling the English speaker to speak Ukrainian. So if I need to retrain, I need to train it in Ukrainian speech.

leminhnguyen commented 1 month ago

When using cross-lingual with the pretrained model, English speaker as the target, Vietnamese speaker as the source the results is good. But when I train from scratch or finetune for Vietnamese from the pretrained, then try voice conversion with both source and target are Vietnamese the output audio is strange and not good as the results for english. It seems a different speaker & I think it has a problem with the pitch.

  1. @zhenhaoge you can train for Ukrainian and confirm that? If you have good results let's me know.
  2. @OlaWod did you have any suggestion to train on other languages ?
zhenhaoge commented 1 month ago

@leminhnguyen Sure, I will first try the method using the current model as @OlaWod mentioned, later I may train for another language. I will keep you posted.

OlaWod commented 1 month ago

When using cross-lingual with the pretrained model, English speaker as the target, Vietnamese speaker as the source the results is good. But when I train from scratch or finetune for Vietnamese from the pretrained, then try voice conversion with both source and target are Vietnamese the output audio is strange and not good as the results for english. It seems a different speaker & I think it has a problem with the pitch.

  1. @zhenhaoge you can train for Ukrainian and confirm that? If you have good results let's me know.
  2. @OlaWod did you have any suggestion to train on other languages ?

i think this should be language agnostic.

check if manually adjust f0 here when infer makes it better?

lareina-a commented 1 week ago

hello,do you have encounter this question in the reference? Traceback (most recent call last): File "/root/autodl-tmp/PitchVC/convert_sp.py", line 168, in wav, sr = librosa.load(src_wav, sr=16000) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/librosa/core/audio.py", line 184, in load y, sr_native = audioread_load(path, offset, duration, dtype) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), *kw) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/librosa/util/decorators.py", line 59, in __wrapper return func(args, **kwargs) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/librosa/core/audio.py", line 240, in audioread_load reader = audioread.audio_open(path) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/audioread/init__.py", line 127, in audio_open return BackendClass(path) File "/root/miniconda3/envs/pitchvc/lib/python3.9/site-packages/audioread/rawread.py", line 59, in init__ self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'src1.wav'