Plachtaa / VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io
MIT License
7.42k stars 747 forks source link

AR model streaming capabilities #135

Open shigabeev opened 8 months ago

shigabeev commented 8 months ago

Hi! Thank you for publishing your model weights and code. I'm wondering whether it's possible to get inference token-by-token. Paper has notion of AutoRegressive model, and however I see that it's autoregressive towards outputs and not text inputs, I still hope that I'm missing something. The usecase is audio streaming for LLMs such as ChatGPT.

So, is there a way to make inference of the model having only first few words from the sentence and feeding the rest of text input as it comes from LLM? And if yes, can you show me how?

alexivaner commented 5 months ago

I also having the same question