coleam00 / bolt.new-any-llm

Prompt, run, edit, and deploy full-stack web applications using any LLM you want!
https://bolt.new
MIT License
3.89k stars 1.6k forks source link

feat: added voice prompting #281

Open milutinke opened 1 week ago

milutinke commented 1 week ago

Hey everyone, I've implemented voice prompting using Whisper. I've added a way for changing the provider and the whisper model using environment variables, the default is set to OpenAI, but any OpenAI compatible provider should work.

Here is a demo:

https://github.com/user-attachments/assets/c1a2b13d-5436-419a-b23c-40e8f16791cd

Note: This is the first version that I made just to have something usable, I want to add noise reduction in the future, maybe compression, and maybe add an option to auto-enhance the prompt after conversion (a checkbox or something).

navyseal4000 commented 1 week ago

I like how your implementation looks better than mine for this. That said, I like not having an AI capable vector in the speech to text module because it's easy to accidentally run up a bill and there's already an enhance prompt button (that may or may not work, I actually haven't messed with that yet). Of course, any other opinions on the matter are welcome - I'm not attached to which of our implementations gets merged in, I just want the best one in.

wonderwhy-er commented 1 week ago

I think browser is good enough for our needs so far, less dependancy.

Browser support for speech to text should be pretty good and wide.

Whisper is still better but it comes with heavy local model load or payed API.

In that sense... I like browser API better.

milutinke commented 1 week ago

Maybe this one can be used as a fall back if the browser doesn't support the API?

wonderwhy-er commented 1 week ago

Also @navyseal4000 solution does real time adding to text. And I like that it keeps working all the time. So I can press submit, see what happened and speak further.

Closer to how ChatGPT advanced voice works in some sense.

As for support. Here it is https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

Looks like Opera and Firefox have some limitations.

Does it warrant adding 300 lines of code and payed 3rd party/service to support and fix when it breaks?

I would keep it simple, it works on Chrome/Edge/Safari...

Though I guess there are Firefox users out there...

I think I am most bothered by it using OpenAI api. There is this https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu that loads model on client side and does the same.

Its kinda cool.

But as far as I know it would also fail on mobile due to WEbGPU support depending on a browser...

I know you put a lot of work and passion in to this @milutinke but consider upkeep of additional dependencies in the future. Thank You!

I would stay with browser APIs here.

But I am not strongly against it. Just feels like bit of overhead, I wonder what others thing... I prefer simple and less dependencies :)

milutinke commented 1 week ago

Also @navyseal4000 solution does real time adding to text. And I like that it keeps working all the time. So I can press submit, see what happened and speak further.

Closer to how ChatGPT advanced voice works in some sense.

As for support. Here it is https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

Looks like Opera and Firefox have some limitations.

Does it warrant adding 300 lines of code and payed 3rd party/service to support and fix when it breaks?

I would keep it simple, it works on Chrome/Edge/Safari...

Though I guess there are Firefox users out there...

I think I am most bothered by it using OpenAI api. There is this https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu that loads model on client side and does the same.

Its kinda cool.

But as far as I know it would also fail on mobile due to WEbGPU support depending on a browser...

I know you put a lot of work and passion in to this @milutinke but consider upkeep of additional dependencies in the future. Thank You!

I would stay with browser APIs here.

But I am not strongly against it. Just feels like bit of overhead, I wonder what others thing... I prefer simple and less dependencies :)

I looked at the https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu, it fails in Thorium/Chromium and Brave, as well as in Firefox, so I don't think it's usable yet, but it's cool non the less. I completely understand the worry, but I think having this as a backup would be good on mobile browser and other browser like Firefox (it's my daily driver btw), I think people who use other browsers should have a fallback option, otherwise they would be annoyed. To address your worry, I'd look for a way to run the whisper primarily on the back-end itself (as the way to avoid costly APIs) + having the current option for anyone who would like to use super fast and cheap APIs like Groq, and I'd personally maintain this piece of code if anything breaks not to make it a burden to others. If you still think it's not worth, I'll close the PR, and if people complain about it not working on other browser, then we could add it. Thank you for your time.

Edit: Looks like running whisper can be done through Node: https://github.com/ChetanXpro/nodejs-whisper

wonderwhy-er commented 1 week ago

Ok. If you insist lets do that.

There are non webgpu versions of client whisper. I think they work better and more widely, just slower. And not on server.

Give this a try https://huggingface.co/spaces/Xenova/whisper-web

milutinke commented 1 week ago

Ok. If you unsist lets do that.

There are non webgpu version of client whisper. I think they work better and more widely, just slower. And not on server.

Give this a try https://huggingface.co/spaces/Xenova/whisper-web

This is awesome, works from my testing, I'd add this as a fallback, and set it by default, and if there is requests from people to use Whisper of an API, we could add it later, thus the logic would be reduced, basically cutting out all the back-end stuff for now. I am quite busy with the university stuff, so I'll work on it after 22.

wonderwhy-er commented 1 week ago

Ok, then I will not pick up @navyseal4000 version yet. In your version I don't like that its overlay and it seems to not continue after submit.

In @navyseal4000 version I don't like that its hard to understand weather its recording right now or not, you commented yourself on that and I agree.

So things to do.

  1. Take @navyseal4000 version as basis
  2. Fix part where it keeps adding old text after submit(restart listening on submit?) I can take that fix if needed
  3. Otherwise keep UX that is there, I think its good one, except for following
  4. Make it more visible that its active but not as overlay like in yours
  5. Detect if Speech API is not there and use Whisper CPU local model variant as fallback

What do you think? Need help?

milutinke commented 1 week ago

Ok, then I will not pick up @navyseal4000 version yet. In your version I don't like that its overlay and it seems to not continue after submit.

In @navyseal4000 version I don't like that its hard to understand weather its recording right now or not, you commented yourself on that and I agree.

So things to do.

1. Take @navyseal4000 version as basis

2. Fix part where it keeps adding old text after submit(restart listening on submit?) I can take that fix if needed

3. Otherwise keep UX that is there, I think its good one, except for following

4. Make it more visible that its active but not as overlay like in yours

5. Detect if Speech API is not there and use Whisper CPU local model variant as fallback

What do you think? Need help?

Sounds good, will do on 22. , pretty excited about this feature too. I'll ask you if I need help. Thanks.