For speech audio signal, voice conversion is more and more popular. I wonder if the zero-shot style transfer learning can be used to voice conversion. For example, from a source speaker's voice(sv) to a target speaker's voice(tv). Extract the style(like prosody, stress, accent and so on) of sv and the content(timbre and characters) of tv, and mixed the style and content.
I really looking forward to your reply, thank you.
For speech audio signal, voice conversion is more and more popular. I wonder if the zero-shot style transfer learning can be used to voice conversion. For example, from a source speaker's voice(sv) to a target speaker's voice(tv). Extract the style(like prosody, stress, accent and so on) of sv and the content(timbre and characters) of tv, and mixed the style and content. I really looking forward to your reply, thank you.