collabora / WhisperLive

A nearly-live implementation of OpenAI's Whisper.
MIT License
2.1k stars 286 forks source link

Transcribe audio from videos ahead of time (browser extension) #107

Open kirawi opened 10 months ago

kirawi commented 10 months ago

What I mean is that the transcriptions up to e.g. 10 seconds of where the video is currently playing are computed. This is helpful for languages such as Japanese (and presumably Chinese) where context determines which word it is (e.g. the variations of かえる or まく). I presume that this behavior is why the transcription is constantly being modified rather than solely appended to. It is very jarring. Transcribing ahead of time should give enough context to significantly reduce how often this happens. https://github.com/vantezzen/skip-silence does something similar I believe to debounce how often videos get sped up based on spans of silence.

I'd also suggest erasing past transcriptions after a period of silence, but I presume this might already be implemented?

zoq commented 10 months ago

If I understand you correctly, you would delay the playback by e.g. 10 seconds and use the 10 seconds as context, instead of doing it in a live setting directly from the beginning? Similar to caching a video, so the playback is continuous.

We haven't thought about it, but I agree in certain scenarios this would be helpful and improve the user experience. We are currently working on a list of features, we will add it to list.

Let us know if this is what you head in mind.

kirawi commented 10 months ago

Yeah, pretty much. I thought that it would be possible to use the padded footage (e.g. YouTube which uses it to reduce buffering) but on second thought that may not be standard behavior for <video>.

fallenangel3k commented 9 months ago

the difference between live-video/stream (like twitch) or VOD (video on demand) (like youtube) .... would be great if whisperlive could handle both and do the putting-it-together thing which is really very scrambling to read into a straight line as suggested by the original poster, "read ahead of time" . only when no further data is already available, then it should relay on its awesome code to put things right, by hearing word by word in realtime. think this clearifies the original question. i would love to see it implemented. <3