Add InteractMode Component with TTS and STT Functions

o-stahl commented 5 months ago

Summary

This pull request introduces a new InteractMode component and integrates text-to-speech (TTS) and speech-to-text (STT) functionalities (the latter is not fully implemented in InteractMode). The enhancement by default leverages the Web Speech API and OpenAI's Whisper API to provide improved speech transcription.

Screenshot 2024-05-30 020714

Key Changes

InteractMode Component:
- Implemented the InteractMode component to handle speech interactions within the chat application.
- Added functionality to monitor and visualize audio input in real-time.
fetchTTSResponse Function:
- Added fetchTTSResponse function to convert text to speech using the OpenAI API.
- Ensures high-quality audio playback of transcribed text.
fetchSTTResponse Function:
- Added fetchSTTResponse function to transcribe audio to text using the OpenAI Whisper API.
- Utilizes the Web Speech API for initial speech detection and transcription.
- Switches to Whisper API for more accurate transcription when enabled.
Toggle for Enhanced Accuracy:
- Introduced a toggle to switch between Web Speech API and Whisper API for transcription.
- Ensures only relevant speech is transcribed, reducing noise and improving accuracy.

Benefits

Enhanced user experience by enabling multimodal interaction.
Improved usage of OpenAI's endpoints, now also including TTS and STT.
Provides users with accurate and reliable speech-to-text and text-to-speech capabilities.

Notes & future plans

This is the first revision and only implements user speech to message transcription, but it should be perfectly usable in it's current state.

Speech to text on assistant messages when the interact mode is enabled. (40754)
Settings tab for TTS/STT related selections especially whether to use only Web Speech API.
Adding TTS/STT functionalities to the other providers.

Auto Generated Notes (Do Not Change)

fingerthief commented 5 months ago

Really excellent work on this!

I've done some testing and I think this is easily solid enough to go ahead and merge into the main branch.

I made one commit to tweak a few little things:

Added a dynamic check for the highest quality supported audio format for the user's current device. It starts checking with the highest quality format and falls back to the next highest quality if it isn't supported. Rinse and repeat until the highest quality format that is supported is found.
Removed showing the error for no-speech while in interact mode. Otherwise it shows as an error after a bit of silence with no speech.
Increased audio playback speed by 5%
switched to tts-1-hd model as it seems to work fine
- Soon enough this will be user configurable along with speed etc..
Notes

I know the mobile support for interact mode has some wonkiness on my phone at least, I'll be creating an issue for that problem though. I have some notion of an idea for a dynamic noise floor level calculation so our speech detection floor can vary with microphone sensitivity

o-stahl commented 5 months ago

switched to tts-1-hd model as it seems to work fine

OpenAI's regular "tts-1" model is faster and 2x cheaper while according to user feedback the quality difference is (or at least was) barely noticeable even with audiophile gear. However as you mentioned as well, model selection will take care of different preferences.

fingerthief / minimal-chat