braden-w / whispering

https://whispering.bradenwong.com/
MIT License
200 stars 22 forks source link

I have returned! #103

Closed braden-w closed 1 week ago

braden-w commented 2 months ago

Hello everyone!

I wanted to apologize for my absence over the past few weeks. It was the very end of school, and things got really hectic. However, I have now returned and am super excited to resume development on Whispering!

I'm going to make a big push for an update this week, then I'll be diving into issues, so you might see me in your threads sometime in the coming weeks. There have been quite a few, and I’m absolutely thrilled to see how much this project has grown—more than I ever imagined!

Thank you all so much for your amazing support and patience. This is my first open source project and I'm so glad people have found it useful. I can't wait to get back to work and tackle your concerns and suggestions.

Best, Braden

braden-w commented 2 months ago

In my head, the plan for development in the next few weeks will be the following two things (not necessarily in this order):

  1. Personal updates to the app. These are things I've been thinking about for the past few months, such as updating to Svelte 5, code cleanup, fixing deployment, etc.
  2. Addressing issues—there are over 100 so far, and I finally have time to sit down and respond to each of them!
cgbur commented 2 months ago

It's selfish of me to promote my own biggest issue, but I think it would hopefully resonate with others. I exclusively use the desktop app, so that's my biggest focus. There's a bunch of small features that I think would be neat, and overall enhancements that you could try, but by far the biggest one is adding a manual or automatic retry when the Whisper API fails. Sometimes if you record a minute or two of speaking and it just totally gives you a bad response, there's no way to resend the request, which is very frustrating. I end up having to type out what I said by listening back or just re-record and hope it's good.

Also, custom post-processing would be awesome. https://platform.openai.com/docs/guides/speech-to-text/improving-reliability

swimJim commented 2 months ago

What I do in that case is I just download the clip of my speaking which is saved in the Whisper desktop app and I run it through another free Whisper type service on the Hugging Face platform. That makes it so the audio clip can be transcribed. Still, your request is a good one.

doxgt commented 2 months ago

Based on my fairly extensive usage of the Whisper API over the last 10 days or so, I have to say that "resending the audio" to OpenAI sounds attractive in theory. However, in actuality, the results of transcription stay remarkably constant from one run to another, separated by minutes apart. If I intentionally speak "near gibberish", and very fast, there could be some appreciable differences from one run to another (with the same audio snippet).

Otherwise, I almost wonder if there was some kind of "memory effect". Even using a different API key does not seem to effect any meaningful difference.

I've experimented with that through Curl and AHK's hotkeys very easily, i.e., resending the same snippet, several times.

I say from my own experience, the #1. Useful feature is prompting. The #2. Useful feature is prompting. The #3. Useful feature, I'm sorry to belabor, is still prompting.

Even though I no longer use whispering myself, I have colleagues who are bound to MacBooks who would be eminently interested in the next iteration of whispering.. And many would prefer a nicely thought out GUI, such as what whispering offers.

PS, even though what Jim does seems circuitous, it actually makes sense to have the second run carried out on a different instance of Whisper altogether. I'm currently thinking maybe even over at DeepGram or Assembly AI ...

PS #2. I doubt Braden would be able to implement "post-processing". I'd be pleasantly surprised if he does. Because this is heavily user-specific. Prompting is what allows a user to do their own "custom" post-processing, especially with respect to punctuation and paragraphing formatting, etc.

cgbur commented 2 months ago

Yes I don’t expect different results from run to run to be different. It’s after speaking for 60 seconds and it gives a red “failed” message you are unable to do anything but listen to your recording. What I recorded was good and I want it transcribed, there’s just not a way to do that right now beyond doing a whole new recording. I use it for dialogue back-and-forth maybe 200-500 times a day at work, and the failure rate is pretty high at times which really breaks up work and hurts trust that you can record long messages and not worry about it going into the ether.

Also yes it would be very simple to implement post processing. You would not hard code it, but leave a text box for the user. Perhaps a few defaults from a dropdown to get users started if they don’t know how to use it.

doxgt commented 2 months ago

I see, my friend. Pardon me for misconstruing what you meant. Now thinking back, to when I was using whispering, I do recall experiencing one failure of not getting transcription back after dictating something over a minute in length, and I cannot remember the specifics right now.

I do wonder, whether this could have been format of audio file related. From what I could see, whispering "encapsulates" audio in a wav "blob". I currently use M4A instead. A couple minutes of a WAV file potentially can exceed 25MB, which is the OpenAI limit, as you know. Braden himself will know more about the details.

For what it's worth, I have not seen a single failure as I started using my own autohotkey app from last weekend. I did encounter a reasonable amount of transcription engine doing bizarre things, including the rare outright hallucinations, especially with short dictation.

If you're on Windows and interested., you can try the super simple, autohockey based cURL script in the meantime: https://github.com/doxgt/PlayGround/blob/main/GPT_cURL.ahk

doxgt commented 2 months ago

Also yes it would be very simple to implement post processing. You would not hard code it, but leave a text box for the user. Perhaps a few defaults from a dropdown to get users started if they don’t know how to use it.

On that, I think you are referring to "prompting". Indeed, prompting should be relatively easy to implement as it is just user input.

However, by "post-processing", I mean specifically custom formatting of the text you get back from whispering. It currently involves multiple regex statements for me. Practically, I can't fathom how Brayden would do that for everyone, not knowing the preference of majority of users. I'm not saying he couldn't do something clever as whispering is clever ... let's wait and see.

cogscides commented 1 month ago

Congratulations on your comeback! Waiting for upcoming updates 🖤

braden-w commented 1 month ago

Sorry for the delays! I’m still making a ton of updates, which you can follow in the 104-feat-svelte-5 support branch. Extension development has been having a few hiccups, but hopefully should be working very soon!

christopherAI2022 commented 1 month ago

I think the most important thing to add is a way to download the audio.

I whipped up my own little Whisper feature on our charity's website - and the download audio has saved me many a time.

When Whispering for some reason fails and I have already spent a minute or so thinking out loud... one time I had to even pipe the audio into Audacity to record it to upload it to our site. If there simply was a "Download Audio" option this would potentially rectify that issue.

andydataguy commented 1 month ago

Glad you're back! The chrome plugin has been a gamechanger for me.

BTW please add a link to donate/gift you somewhere in the README.md. Would love to support your work 😊

Updated: The issue I was facing was from the API.. openai not autocharging!

doxgt commented 1 month ago

Nice to see the new web-based interface at: https://whispering.bradenwong.com/

Is it supposed to be working though? It does not seem to be working for me as per brief testing tonight. Nevertheless, looking good!

braden-w commented 1 month ago

Thank you so much everyone! I'm starting to come close to establishing a stable CI/CD pipeline and getting my updates pushed out consistently.

Sometimes if you record a minute or two of speaking and it just totally gives you a bad response, there's no way to resend the request, which is very frustrating. I end up having to type out what I said by listening back or just re-record and hope it's good.

Also, custom post-processing would be awesome. https://platform.openai.com/docs/guides/speech-to-text/improving-reliability

I just wanted to say that these retrying requests was the first priority of this update and @cgbur hopefully you should be able to resend messages via the /recordings page!

Post processing is my plan for v5.0 :D

DavidGP commented 4 weeks ago

So glad that you're back and pushing a lot of updates!

If I may kindly remind of the close-to-tray feature of a fork of Whispering, it would be very useful if Whispering could be minimized to the tray instead of exiting when the close button is clicked.

I believe, especially with the latest update that includes the two feedback sounds and the red recording icon in the tray, there's no longer a need to keep Whispering open as a window, and even less so in the foreground.

braden-w commented 1 week ago

Thank you guys again for the support in this thread! Really appreciate the kind words.

I split off some of your recommendations to separate issues, and I'm now going to close this specific thread since I'm back for real. :)

Thank you again and keep the issues coming!