Open JonathanFly opened 1 year ago
Yea Whisper has a tendency to hallucinate on non-speech segments. That is why I added extra voice activity detection by Silero VAD. Maybe you can improve the issue by setting the default threshold for the VAD to something higher than 0.5 here.
Currently segments are only discarded if they contain no speech. I have an idea to use the VAD to cut out non-speech parts from segments that contain only a little bit of speech. Maybe that also helps.
I don't know what the best way would be to feed the audio directly. This repo uses some tool to capture the speaker output. There are plenty of Whisper showcases that use mic input. I will see if I can add something like that as an option when I have some free time.
I frankenstein'd an absurd way to do this. Restream the stream to localhost.
Install OBS Studio
Install NVIDIA Audio Effects SDK: https://www.nvidia.com/en-us/geforce/broadcasting/broadcast-sdk/resources/
Open OBS setup your stream to stream your desktop audio to a custom rtmp server on localhost. File, Settings, Stream, Service: Custom..., "rtmp://127.0.0.1:1234"
Right click 'Desktop Audio' Audio Mixer, select 'Filters' Click + bottom left, add Noise Suppression. Pick NVIDIA Noise removal. I set it to max level 100, it seems less aggressive than the Desktop version. (Click 'Start Recording' instead of 'Start Streaming' to check the effect on the audio.)
Then I changed this ffmpeg call in translator.py to keep trying to open a stream instead of bailing:
def open_stream(stream, direct_url, preferred_quality):
if direct_url:
try:
process = (
ffmpeg.input(stream, re=None, listen=1, loglevel="panic")
.output("pipe:", format="s16le", acodec="pcm_s16le", ac=1, ar=SAMPLE_RATE, y=None)
.global_args('-nostdin') # Disable interaction on standard input
.run_async(pipe_stdout=True)
)
except ffmpeg.Error as e:
raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
return process, None
(Not sure all the FFMPEG parameters are needed, I brute forced it and stopped when it worked.)
Then run 'python translator.py --direct_url rtmp://127.0.0.1:1234' and wait a second until the script is listening, then click Start Streaming in OBS. This will stream your regular computer desktop audio, whatever you have playing, from any source. I'm sure someone who knows OSB can setup OBS in a more selective way so it's not ALL your desktop audio.
One the one hand, restreaming a stream to your computer is an abomination of a solution. On the other hand, OBS is a state of the art real time audio processing powerhouse -- so you could leverage any other audio processing in the chain here, the sky's the limit. It's also kind of nice to freely browse different streams, leaving the translator open. Sometimes it even works fine jumping between languages!
Also OBS should be able to use streamlink directly: https://github.com/dd-center/obs-streamlink
That plugin should separate stream and desktop audio cleanly, because OBS pulls in the audio directly using streamlink. I tried it and the audio kept stuttering though. So I just used my desktop audio for now.
While doing this I was thinking, if you do cleanly separate the desktop/stream audio (via the streamlink plugin, or a 'virtual audio cable', or whatever) then a second OBS instance could restream original audio and video, but with subtitles overlayed on top using OBS text/greenscreen features. I briefly tested just showing the terminal window with the translation on the screen, with the background made transparent, like a typical stream chat. And potentially delaying the video so the subs line up. A little overkill though.
This will stream your regular computer desktop audio, whatever you have playing, from any source. I'm sure someone who knows OSB can setup OBS in a more selective way so it's not ALL your desktop audio.
I'm trying out OBS for the first time, but I found that you can select "Application audio capture (BETA)" as a source and it does exactly what you want. So no need for streamlink in that case.
Integrating with OBS seems like an interesting idea. It would have quite a few nice features:
We would want a pipeline like: OBS: capture audio & noise suppression -> ffmpeg: convert to 16kHz mono int16 -> whisper: decode -> OBS: overlay text on stream
There doesn't seem to be ways to pipe data in/out of OBS directly, the best we get is streaming to an URL. It would definitely be possible to do this with plugins/scripts, but I couldn't find any useful ones so far. Maybe I would have to build that myself. At that point, I think it would be easier to turn the entire pipeline into an OBS script.
It toyed around with that idea for a bit and unfortunately there are some roadblocks: E.g., I couldn't find a way to activate virtual environments for OBS scripts, so one would either have to install everything in the global python install (yikes), or even dirtier... append site-packages of the venv to the path. It seems hacky but doable.
- Customizable OSD (as suggested in Have you considered adding this stalled project UI functionality #6)
I've been poking at this. I think just outputting the captions to a webpage gets you a lot of the value by itself, without OBS. As long as you can point a browser at it, you can position it on screen where you want (even from Colab, with a little work), even make it transparent with some browser extensions. And it's just a web page so it's super easy for anyone to change the size, appearance, layout, use browser tools to auto translate the words, copy and paste, whatever.
BTW faster_whisper integrated Silero Vad. Seems to have some rough spots so far, but could be a nice upgrade, and they'll improve the integration over time.
I'll post my translator.py experiments, but I ripped the guts out just messing around so it's sort of a mess and sort of refactored, and probably somewhat broken. Some of it was refactored by ChatGPT because I was trying to test out different processes of refactoring and this project was the perfect small size that fits entirely in the context window memory. Some cool actual upgrades like using numpy_ringbuffer, way too many list comprehensions added, and an amazing 100% failure rate every single time it tried to extract functions out of the long main() function for some reason.
BTW there's some cool features in this project that in a perfect world could be integrated. You can see the first-pass caption, and then the update when Whisper reruns given more context:
https://github.com/openai/whisper/discussions/608
I've seen some of the examples that do real time transcription and they're great but they all record short snippets of audio and then transcribe them one after the other. This has two problems:
I tackled these problems by always recording audio in a thread so there are no gaps and by concatenating the previous audio data with the latest recording. This allows you to rerun transcriptions on previously incomplete audio snippets. The result is that the model can correct issues from when it transcribed a recording that was cutoff.
https://github.com/davabase/transcriber_app https://github.com/davabase/whisper_real_time
I'm pretty sure I made things worse but I left the code here anyway, as an example of using OBS like this for anyone else searching: https://github.com/JonathanFly/faster-whisper-livestream-translator
This is a pretty sweet repo, been using it a couple times a week recently. Faster-whisper lets you actually run the large model in real time with good latency on a 3090. Actually it's even more insane, I run TWO LARGE MODELS AT THE SAME TIME, two stream-translators, so that I can have dual subtitles: one transcribed, one translated. It works fine 3090 as long as you are just doing normal desktop stuff! Wild.
But when streams have a lot of background noise (music, game sounds) I found you NEED to add some decent real-time noise reduction or Whisper just faceplants over and over.
Mainly I've used the Nvidia Broadcast tool to do this in real time, with a virtual cable if needed to get the audio routed correctly. Whisper is back at full power if I do this. But since stream-translator streams the audio directly I have to use something else instead.
If this could take in mic/speaker device audio alternative to streamlink, that would do it. Using this option loses the simplicity and latency benefits of streaming directly, but the alternative is Whisper collapsing in confusion on some streams. I know other repos already take in direct audio but ideally I want to stop bouncing between them...
Maybe there's a more elegant way to accept direct audio input that doesn't require a wacky virtual-cable or whatever? OBS studio integrates NVIDIA noise reduction via the broadcast SDK. Or there could be a good open source solution. I tried a couple but none of the real-times were close to good enough, compared to the NVIDIA version.