Since GPT part is an AR model, it can run as stream mode, and vocoder model as a conv-based model, can also be runned as stream which means it can run as chunks using sliding window,can diffusion model used as chunks too? So the whole process can be runed as a stream mode
Since GPT part is an AR model, it can run as stream mode, and vocoder model as a conv-based model, can also be runned as stream which means it can run as chunks using sliding window,can diffusion model used as chunks too? So the whole process can be runed as a stream mode