Support for averaging speaker embeddings in the interface?

effusiveperiscope / so-vits-svc

so-vits-svc

MIT License

179 stars 72 forks source link

Support for averaging speaker embeddings in the interface? #19

Closed RAYTRAC3R closed 1 year ago

RAYTRAC3R commented 1 year ago

If you have a multispeaker model, it's possible to average the speaker embeddings to mix two different voices using torch.lerp() on the variable g within models.infer I've successfully done this, but it requires some fiddling with the code. Could it be possible to add as an option to the interface somehow? I'd do it myself but I'm not sure how to go about it. It could be useful if you split a speaker into multiple slots on a multispeaker based on certain vocal traits (ex. emotions, tones, or characters with the same VA) and you want to freely mix them.

mya2152 commented 1 year ago

Isn't the current version doing this already though, couldn't you just load the audio files of multiple different speakers into the dataset and the model would be generated like that anyway with the "average" sound being incorporated after multiple iterations?

effusiveperiscope commented 1 year ago

Isn't the current version doing this already though, couldn't you just load the audio files of multiple different speakers into the dataset and the model would be generated like that anyway with the "average" sound being incorporated after multiple iterations?

You could do this but I could also see the utility in being able to do this with pre-existing models that may contain multiple speakers. Because lerp() can be used you could also specify the degree to which the speakers are mixed.

cody151 commented 1 year ago

Isn't the current version doing this already though, couldn't you just load the audio files of multiple different speakers into the dataset and the model would be generated like that anyway with the "average" sound being incorporated after multiple iterations?

You could do this but I could also see the utility in being able to do this with pre-existing models that may contain multiple speakers. Because lerp() can be used you could also specify the degree to which the speakers are mixed.

Oh I see applying weightings for each speaker yeah that would allow for more control I suppose but at some point this level of control would really bring about diminishing returns surely, as it stands inflections and tonality is already pretty well done

effusiveperiscope commented 1 year ago

Added in e8cef9d, but disabled by default and put under a commandline switch since it seems temperamental and I don't expect this to be a feature used by most users. The switch is --custom_merge. Let me know if it looks like the right functionality.