fishaudio / fish-diffusion

An easy to understand TTS / SVS / SVC framework
https://diff.fish.audio
MIT License
635 stars 81 forks source link

One Shot Conversion Possible? #80

Closed chigkim closed 1 year ago

chigkim commented 1 year ago

Is there a way to use fish-diffusion to train a model for one shot any to any SVC? You give one source sample and one target samples, and the model extract contents (pitches, contour, syllables) from source and uses voice tone from target sample to generate new sample. Basically source singing voice needs to be converted as if it was sung by the target singer while keeping the contents unchanged. It's like Singing Voice Conversion Challenge 2023, but any to any one shot. http://www.vc-challenge.org/ If not, anyone knows such model?

LordElf commented 1 year ago

You mean cross-domain conversion? The current version only supports in-domain conversion. However, I think it's possible to apply cross-domain conversion based on fish diffusion. We will add it to the todo list. Thanks!

chigkim commented 1 year ago

I meant one shot conversion where you just provide source and short target sample and convert without training. There are models that can do one shot speech conversion, so I was wondering if this could be applied to singing voice conversion as well. Having said that, cross domain (speech and singing) would be also great!

LordElf commented 1 year ago

I think cross-domain conversion is exactly what you called one shot speech conversion :) And yes, it's possible to do that based on fish, theoretically. Will let you know when we add support for that

chigkim commented 1 year ago

I'm not an expert, but I think One-shot voice conversion and cross-domain conversion are different things.

Basically cross domain conversion is to convert between different domains like speech vs singing whereas one-shot conversion is to make the model in a such way that it can handle voices (both source and target) that have not seen during training.

"Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training." https://arxiv.org/abs/1904.05742

"One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging." https://arxiv.org/abs/2212.14227

leng-yue commented 1 year ago
image

As the above image shows, the diffusion fine-tuning doesn't work really well on cross domain.