OvidijusParsiunas / deep-chat

Fully customizable AI chatbot component for your website
https://deepchat.dev
MIT License
1.27k stars 175 forks source link

Text-to-Speech functionality does not work for html messages #121

Closed phatneglo closed 4 months ago

phatneglo commented 4 months ago

Huge shoutout to the folks behind this repo! 🙌 Seriously, you guys have saved me more times than I can count. Everything from the docs to the code is just top-notch and super easy to dive into. It's been a game-changer for my projects, and I've learned a ton along the way.

Big thanks for all the hard work you've put in. You've made something awesome that's not just helpful but also super inspiring.

Environment:

Issue Description: In our Vue 3 chat application, we use streaming to implement text-to-speech (TTS) functionality for chat messages. The TTS works as expected for initial messages. However, when we try to introduce follow-up questions by uncommenting the logic that adds extra HTML content for user selection, the TTS stops working. It seems the TTS functionality fails to read the reply once new HTML content is introduced into the chat.

Steps to Reproduce:

  1. Use TTS with streaming for incoming chat messages.
  2. Receive a chat message and observe TTS working as expected.
  3. Uncomment the logic that fetches and adds follow-up questions with HTML content to the chat.
  4. Observe that TTS does not work for the new content.

Expected Behavior: The text-to-speech should continue to work and read out the chat messages even after adding follow-up questions with HTML content.

Actual Behavior: After adding follow-up questions with HTML content to the chat, the text-to-speech functionality stops working. It appears that the TTS can only read the initial reply from the stream, and fails to process the newly added messages.

Troubleshooting Done:

const requestHandler = {
  handler: async (body, signals) => {
    try {
      if (props.activeThreadId) {
        threadId.value = props.activeThreadId;
      }
      // Endpoint URL
      console.log(threadId.value);
      if (!threadId.value) {
        await startChat(authStore.username, "");
      }

      const endpoint = `${props.endpointUrl}/chat/submit_member_message/${threadId.value}`;

      // Check if the message includes files
      if (body instanceof FormData) {
        // For files, the body is already a FormData object
        // Directly send the FormData
        const response = await fetch(endpoint, {
          method: "POST",
          body: body,
          headers: {
            Authorization: `Bearer ${authStore.token}`,
          },
        });
        const responseData = await response.json();

        if (response.ok) {
          signals.onOpen(); // stops the loading bubble
          // Stream messages
          // Note: Replace with your streaming endpoint and logic
          const messageStream = new EventSource(
            `${props.endpointUrl}/chat/stream_message/${responseData.message_id}`
          );
          messageStream.onmessage = (event) => {
            // Parse the JSON data from the server
            const serverData = JSON.parse(event.data);

            // Check if serverData has the 'text' property and respond accordingly
            if (serverData && serverData.text) {
              signals.onResponse(serverData);
            } else if (serverData && serverData.html) {
              signals.onResponse(serverData);
            } else if (serverData && serverData.files) {
              signals.onResponse({ files: serverData.files });
            } else {
              signals.onResponse({
                error: "Invalid response format from server",
              });
            }
          };
          messageStream.onerror = async () => {
            messageStream.close();
            // if (messageStream.readyState === EventSource.CLOSED) {
            //   const nextQuestions = await getNextQuestion(
            //     responseData.message_id
            //   );
            //   deepChatRef.value.addMessage({ html: nextQuestions });
            // }

            signals.onClose();
          };
          // Handle stop click
          signals.stopClicked.listener = () => {
            messageStream.close();
          };
        } else {
          signals.onResponse({ error: "Error in connecting to chat service" });
        }
      } else {
        // For text-only messages, the body is a JSON object
        // Send a JSON request
        const response = await fetch(endpoint, {
          method: "POST",
          headers: {
            "Content-Type": "application/json",
            Authorization: `Bearer ${authStore.token}`,
          },
          body: JSON.stringify(body),
        });
        const responseData = await response.json();

        if (response.ok) {
          signals.onOpen(); // stops the loading bubble
          // Stream messages
          // Note: Replace with your streaming endpoint and logic
          const messageStream = new EventSource(
            `${props.endpointUrl}/chat/stream_message/${responseData.message_id}`
          );
          messageStream.onmessage = (event) => {
            // Parse the JSON data from the server
            const serverData = JSON.parse(event.data);

            // Check if serverData has the 'text' property and respond accordingly
            if (serverData && serverData.text) {
              signals.onResponse(serverData);
            } else if (serverData && serverData.html) {
              signals.onResponse(serverData);
            } else if (serverData && serverData.files) {
              signals.onResponse({ files: serverData.files });
            } else {
              signals.onResponse({
                error: "Invalid response format from server",
              });
            }
          };
          messageStream.onerror = async () => {
            messageStream.close();
            // if (messageStream.readyState === EventSource.CLOSED) {
            //   const nextQuestions = await getNextQuestion(
            //     responseData.message_id
            //   );
            //   deepChatRef.value.addMessage({ html: nextQuestions });
            // }

            signals.onClose();
          };
          messageStream.onclose = () => {
            // Add a message when the connection is closed
            deepChatRef.value.addMessage({ text: "AI message" });
            signals.onClose();
          };
          signals.onclose = () => {
            deepChatRef.value.addMessage({ text: "AI message" });
          };
          // Handle stop click
          signals.stopClicked.listener = () => {
            messageStream.close();
          };
        } else {
          signals.onResponse({ error: "Error in connecting to chat service" });
        }
      }
    } catch (e) {
      console.error("An error occurred while submitting the message:", e);

      // Determine the type of error and provide a more specific error message
      if (e instanceof TypeError) {
        signals.onResponse({
          error:
            "There was a network issue. Please check your internet connection.",
        });
      } else if (e instanceof SyntaxError) {
        signals.onResponse({
          error:
            "There was a problem parsing response data. Please try again later.",
        });
      } else if (e.name === "AbortError") {
        signals.onResponse({
          error:
            "The request was aborted. You may have navigated away from the page.",
        });
      } else {
        // For unexpected errors, provide a generic error message
        signals.onResponse({
          error: "An unexpected error occurred. Please try again.",
        });
      }
    }
  },
};
OvidijusParsiunas commented 4 months ago

Hi @phatneglo.

Thankyou for the insightful description of your issue.

Unfortunately textToSpeech does not work for html responses as there is no standardized way to determine which part of the markup is the actual text that needs to be spoken out. To get this to work for your case you will have to fork/clone the project and tailor the codebase to your custom html responses. It is actually not that hard to do and the instructions how to get started are listed here.

Another small thing that I noticed is that you are using the addMessage method which has been deprecated. I would instead advise you to set the websocket property to true which will allow you to reply with multiple messages (you do not need to have an actual websocket connection). In addition to this, our recent dev version actually supports using websockets to act as streams - which introduces a stop word to the stream property to indicate that a message has finished streaming. This would work perfectly with your setup, just a few small things will need to be switched around for the websocket infrastructure (check the websocket example in handler documentation). Find out how to use this here.

OvidijusParsiunas commented 4 months ago

Just wanted to add that I am very much on the fence for returning addMessage method due to how many people seem to want to use it. So it may return in the near future. Currently you can actually still use it by simply calling the _addMessage method.

phatneglo commented 4 months ago

@OvidijusParsiunas yeah i notice that too, when i updated the package, i notice addMessage is gone, i found it on _addMessage,

also another issue.

If you notice that the TTS voice continues speaking for a long time, especially with lengthy responses, you can adjust the behavior by using the "stop" or "mute" commands. When the TTS voice is activated, simply say "stop" or "mute" to prevent it from continuing or to pause it temporarily. This will allow you to proceed to the next interaction without the TTS voice continuing to speak. If you encounter any issues with this functionality, please let me know so I can ensure that it is properly addressed.

again thanks for your support! i wish i could help you back when im all done with my project.

OvidijusParsiunas commented 4 months ago

Hey, adding text-to-speech configuration for options such as start/stop/resume is a little tricky to do for a couple or reasons; the first one being that this would be a significant UX change that requires a different message layout to create room for these options (this is one of the reasons why the ChatGPT app has an entirely different chat experience when using STT and TTS), and the second obstacle is that something like this would take a considerably long time to develop.

Therefore, due to this functionality having lower priority than our other upcoming work - I will unfortunately have to say that I currently will not be able to pursue this, but perhaps I can revisit it in the future.

Thankyou for the suggestion!

phatneglo commented 4 months ago

No problem bro! again thanks for your help, I'll close this now!

phatneglo commented 4 months ago

Hey there!

I've been playing around with deep-chat for a bit and really love what you've built. It's been super helpful for adding chat functionalities to my Vue projects. 🚀

While integrating it, I thought it'd be awesome to have a bit more control over the speech synthesis, especially being able to mute or stop the assistant's speech on the fly. I figured this could really amp up the user experience, letting users quickly pause the assistant whenever needed.

So, I tinkered around and came up with a neat little enhancement that does just that. Here's the gist of what I did:

  1. Keeping an Eye on Speech: I used Vue's ref to set up a reactive property called isSpeaking. It keeps track of whether the speech synthesis is doing its thing.

  2. Getting in the Middle: I went ahead and tweaked the window.speechSynthesis.speak method a bit. This way, I could listen in on when speech starts and stops, updating isSpeaking to reflect the current state.

  3. Quick Mute Button: I popped in a Quasar Floating Action Button (FAB) that shows up only when the assistant is chatting away. A quick tap on this button, and voilà, the assistant takes a breather.

Here's how I pieced it together:

const isSpeaking = ref(false);

const cancelSpeech = () => {
  window.speechSynthesis.cancel();
  isSpeaking.value = false; // Let's keep things updated
};

onMounted(() => {
  const originalSpeak = window.speechSynthesis.speak.bind(window.speechSynthesis);

  window.speechSynthesis.speak = (utterance) => {
    utterance.addEventListener("start", () => {
      isSpeaking.value = true;
      console.log("We're talking!");
    });
    utterance.addEventListener("end", () => {
      isSpeaking.value = false;
      console.log("All done talking.");
    });
    originalSpeak(utterance);
  };
});

And in the template:

<q-btn v-if="isSpeaking" fab icon="fas fa-volume-mute" color="teal" @click="cancelSpeech" class="fixed-bottom-right" />

Plus a little styling to keep the button snug in the corner:

.fixed-bottom-right {
  position: fixed;
  right: 20px;
  bottom: 20px;
  z-index: 999;
}

I thought this might be a handy feature for deep-chat! It's been a game changer for me, and I reckon it could be for others too. Maybe it's something that could be baked into the next version? Just a thought! 😄

Keep up the great work!

OvidijusParsiunas commented 4 months ago

Hi @phatneglo.

By default Deep Chat uses the speechSynthesis window property to facilitate the text to speech functionality, so the way you tapped into its functionality from Vue is very clever!

As I mentioned, one of the bigger hurdles of facilitating this functionality from within Deep Chat is the UX as slotting it into the chat in a clean manner is not simple. Ofcourse, then there is the development time that comes along with it. For now, I hope developers can look at your code and build extensible text to speech experiences that way.

Thankyou very much once again @phatneglo!