Implement Streaming Text-to-Speech for Active Message Generations

Overview

We aim to enhance the user interaction with our large language model (LLM) by introducing an optional mode where text-to-speech (TTS) begins as soon as the first sentence of a generation is complete. This feature will provide real-time auditory feedback as messages are being generated, improving the dynamism and interactivity of the application.

Current Implementation

Text-to-Speech (TTS) System: Currently, our TTS system is configured to receive and vocalize complete text messages. This setup does not support streaming or partial text updates.
Message Synchronization: The client and server are synchronized via a subscription model, which updates the client's message view as the server processes and updates the message data. However, this synchronization does not provide any indication to the client that a message is being streamed or when it has completed.

Proposed Implementation

Server-Side

WebSocket Communication: Implement WebSocket messaging that specifically notifies the client when a message starts streaming and when it is completed.
- Send initial WebSocket message when the first part of the message is ready to be vocalized.
- Continue updating as more text becomes available until the message is complete.

Client-Side

Streaming TTS Integration:
- Modify the TTS system to handle partial text updates. Begin vocalizing text as soon as the first update is received.
- Continuously update the spoken message as new text streams in, ensuring seamless vocalization.
Active Message Tracking:
- Implement logic to track which messages are actively being updated and are relevant to the current user view.
- Use the message ID provided by the WebSocket updates to match with the currently active message thread. Only vocalize messages that are actively displayed to the user.

Goals

Responsiveness: Enhance user experience by reducing the wait time between message generation and vocalization.
Accuracy: Ensure that the TTS system accurately reflects the ongoing message generation without duplicating or missing text as updates stream in.

Additional Considerations

Concurrency: Handle scenarios where multiple messages are being generated and updated simultaneously, especially in multi-user environments.
Performance: Assess and optimize the impact of real-time text streaming and vocalization on both server and client performance.

Next Steps

Prototype the WebSocket communication enhancements to handle streaming message notifications.
Develop and integrate the updated TTS system to support partial text updates.
Implement the necessary client-side logic to manage active message tracking and synchronization.
Conduct testing with real users to validate the implementation and gather feedback for further refinements.

Nuvotion-Visuals / Harmony3

Implement Streaming Text-to-Speech for Active Message Generations #42