X-LANCE / UniCATS-CTX-vec2wav

[AAAI 2024] Code for CTX-vec2wav in UniCATS
https://cpdu.github.io/unicats/
115 stars 16 forks source link

Use vec2wav for Speech to Speech Voice conversion #9

Open rishikksh20 opened 9 months ago

rishikksh20 commented 9 months ago

Hi @cantabile-kwok ,

I am curious to know have you tried this model for zero shot voice conversion use case ? Idea is very simple:

Source voice speech -> semantic token -> vec2wav (with target voice prompt) -> Target voice speech

We can easily calculate semantic token from pretrained HuBert or VQ-wav2vec etc.

cantabile-kwok commented 9 months ago

Yes, this is straightforward, and the result seems decent. When using HuBERT tokens, about 2-3 seconds of prompt speech can get speech output that sounds quite similar. However, there are still some tricky things to put this method in real-life cases. As far as I know, VQ-wav2vec still carries some speaker information, and HuBERT reconstructs badly for short speech segments.

rishikksh20 commented 8 months ago

@cantabile-kwok https://arxiv.org/abs/2312.08676 SEF-VC architecture is same as CTX-vec2wav.