implement this whispers model

Sharrnah / whispering-ui

Native UI for the Whispering Tiger project - https://github.com/Sharrnah/whispering (live transcription / translation)

https://whispering-tiger.github.io/

MIT License

232 stars 12 forks source link

implement this whispers model #10

Closed dolev765 closed 1 year ago

dolev765 commented 1 year ago

this is just a suggestion I will try to implement it and show you how i did it using chatgpt as I don't really know how to code at all https://github.com/EtienneAb3d/WhisperHallu

Sharrnah commented 1 year ago

Thanks. i have already seen that project. since we already have VAD and AI Noise cancellation, i think the only thing that differs is the addition of voice markers.

And their implementation has some drawbacks we have to think about.

What about voice markers while using fine tuned models? their transcription might be completely different.
At worst, this means 3 times the processing time compared to a regular transcription, because it retries it in 3 different ways when the first addition of voice markers failed to transcribe. So might be not useful for realtime transcriptions.

In my opinion, that is a bit of a stupid way to prevent hallucination. By adding audio to recordings just to cut them out later in the transcription. But if it works and is fast enough, i think we can live with it until there is a better Speech 2 Text model

And i just read that they are cutting silence parts from the recordings. Thats actually something thats currently not really done by me. Since i have code for that already in the Bark plugin without using ffmpeg, i think i can add it and see how that improves it.

dolev765 commented 1 year ago

I will try to help on my side gonna keep you updated

Sharrnah commented 1 year ago

thanks. I added the silence cutting already in the master branch. Not 100% sure if it helped in preventing hallucinations. Worked relatively fine in my testing. only rarely some hallucinations, mostly when realtime transcribing. final transcribtion was fine most of the time. Will have a look if i can improve it a bit more.

dolev765 commented 1 year ago

@Sharrnah I cannot thank you enough

Sharrnah commented 1 year ago

Improved the silence trimming so it only cuts in the middle of a found silence part (to prevent it cutting into the start of words for example)

Also added normalization so very loud or quite parts are moved more into normal ranges (which is also a thing the WhisperHallu project seems to do)

actually found another issue while working with the silence trimming and normalization features which reduced the recording quality when it had to resample the audio because WASAPI did not allow requesting the correct samples from the audio device directly.

(turns out, resampling each audio chunk seperately and then merging it together isn't so great)

So thats fixed now as well in the master branch which should increase the audio quality and might improve the transcription quality with it. (even though Whisper is quite resilient to some noise).

So if the silence trimming does not help, that might help a bit.

I guess we are now just missing voice markers to have pretty similar functionality.

Sharrnah commented 1 year ago

@dolev765 Looked a bit into how they do the voice marker logic. But one thing i am unsure about is: They add voice markers based on the language.

But what is if you set the Whisper A.I. to autodetect the language. Then the language is not know beforehand. Would that mean we have to run the audio through the AI once before to just find out the language? (which might be wrong anyway).

Might not be an issue if you set the language yourself, but you don't always know what language other people talk beforehand so i like to keep it on auto most of the time except when i talk myself and i know what language i speak. 🤔

Sharrnah commented 1 year ago

thanks. But we already have a seperate AI that can detect the language of text. However we would need to guess the language based on audio, as thats what we first have before the actual transcription. And it would have to be faster than Whisper itself. (and probably more accurate)

Sharrnah commented 1 year ago

Implemented a first iteration of voice markers. ( see here: https://github.com/Sharrnah/whispering/blob/main/Models/STT/whisper_audio_markers.py )

haven't decided yet if to enable it by default and where to put the option to enable / disable it in the UI yet (besides the Advanced -> Settings).

Even found a little issue with the original implemented regex, as it could keep parts of the voice markers text in the result.

Sharrnah commented 1 year ago

Since voice markers are integrated now in the newest version, i will close this issue.

Feel free to open a new one if you find any issues.

dolev765 commented 1 year ago

Implemented a first iteration of voice markers. ( see here: https://github.com/Sharrnah/whispering/blob/main/Models/STT/whisper_audio_markers.py )

haven't decided yet if to enable it by default and where to put the option to enable / disable it in the UI yet (besides the Advanced -> Settings).

Even found a little issue with the original implemented regex, as it could keep parts of the voice markers text in the result.

Wow thats amazing to hear I will try it out and tell you how it goes