Adds a proper TTS streaming via HTTP by using coqui's inference_stream method and fastapi's StreamingResponse. The client can consume the new data as soon as it's ready. I found the chunk size of 100 (coqui tokens I assume?) to provide a favorable latency/interruption rate on my MacBook running CPU inference.
Implications
Works only with local models.
Uses HTTP GET instead of HTTP POST. Explanation below.
Initially, I wanted to stick to HTTP POST requests only and do audio playback using client-side JavaScript, but unfortunately MediaSource does not support working with WAV data. Adding intermediate compression would only increase latency and create more complexity. Using HTTP GET allows doing playback directly from HTML by setting the audio source to the API endpoint, the browser will do all the buffering and decoding at no extra cost.
Description
Adds a proper TTS streaming via HTTP by using coqui's inference_stream method and fastapi's StreamingResponse. The client can consume the new data as soon as it's ready. I found the chunk size of 100 (coqui tokens I assume?) to provide a favorable latency/interruption rate on my MacBook running CPU inference.
Implications
Initially, I wanted to stick to HTTP POST requests only and do audio playback using client-side JavaScript, but unfortunately MediaSource does not support working with WAV data. Adding intermediate compression would only increase latency and create more complexity. Using HTTP GET allows doing playback directly from HTML by setting the audio source to the API endpoint, the browser will do all the buffering and decoding at no extra cost.
Related SillyTavern pull request: https://github.com/SillyTavern/SillyTavern/pull/1623
References