Plachtaa / seed-vc

State-of-the-Art zero-shot voice conversion & singing voice conversion with in context learning
GNU General Public License v3.0
644 stars 71 forks source link

Streaming inference and speech token extraction #6

Open vitreo12 opened 2 months ago

vitreo12 commented 2 months ago

Hello!

First off, amazing project! I cloned it and got it up and running quite easily.

I have a question about the streaming inference mentioned in the readme. Is this to allow to run inference on real-time audio, instead of one shot conversion? I ran some benchmarks and it seems to me that the bottleneck is with the cosyvoice speech token extraction. I wonder how could this work with real-time audio? The target voice speech tokens can be extracted before inference time, but I wonder how would you approach extracting the one for the source voice. Do you plan on a different architecture to make it work for real-time audio streams?

Thank you and have a great day!

Plachtaa commented 2 months ago

Thanks for your positive comments. Regarding to your questions, we already have a solution for streaming inference, but we are still on the way of increasing stability and inference speed before releasing it

vitreo12 commented 2 months ago

Nice! Looking forward to the implementation then :)

Plachtaa commented 3 weeks ago

Streaming inference GUI has been released