WofWca / jumpcutter

⏩ Fast-forwards long pauses between sentences — watch lectures ~1.5x faster (browser extension)
https://chrome.google.com/webstore/detail/jump-cutter/lmppdpldfpfdlipofacekcfleacbbncp
GNU Affero General Public License v3.0
356 stars 13 forks source link

feat: jump based on "Voice Activity Detection"/"Speech Recognition" #164

Closed nvxos closed 1 year ago

nvxos commented 1 year ago

I know this is no small feat, but a feature that would be insanely good in my opinion is for the extension to be able to jump/cut based on VAD (Voice Activity Detection) or Speech Recognition.

I tried doing some research on the matter, mainly to find an editor that would be able to cut parts of a video that doesn't contain speech, and for example there's the new paid jumpcutter (gui, is in beta and has a trial) from carykh (jumpcutter.com) that now can jump/cut using VAD, but it's a bit slow and lacking and you can't use it in CLI, which is what I'm mainly looking for. There's also cloud-based services like wisecut.video but it's not suitable for my use case being priced/limited in video time/size/etc.

And it's while doing this research that I found this extension, that I found actually pretty useful for different use cases than what I was looking for (I have a lot of media files which I would like to trim the non-speech parts, but I also consume quite a bit of content online and I'm glad to have found this extension for this)

So having not found anything that could do what I wanted I'm now looking into maybe coding myself a script to do it. And so I thought that maybe I could share the resources I've found to this point to help implement this in this extension if this ever gets implemented, which I think would be such a huge and useful feature. Sadly everything I've found is mainly in python so not sure how well it could apply to this project.

But here's what I've got so far:

https://archive.is/20220527092223/https://towardsdatascience.com/automatic-video-editing-using-python-324e5efd7eba https://archive.is/S6a4V https://wandb.ai/yvrjsharma/posts/reports/Video-Editing-Using-Automatic-Speech-Recognition---VmlldzoyMTY4OTQy https://realpython.com/python-speech-recognition/ https://thegradient.pub/one-voice-detector-to-rule-them-all/

https://github.com/openai/whisper https://github.com/snakers4/silero-vad https://github.com/wiseman/py-webrtcvad https://github.com/Picovoice/cobra https://github.com/alibaba-damo-academy/FunASR

Edit: Adding some links which seem more suitable for this extension: https://github.com/ccoreilly/vosk-browser https://github.com/wiseman/py-webrtcvad

(Silero VAD seems to be the best model to use)

WofWca commented 1 year ago

I didn't read through yet, but take a look at #46 for now.

nvxos commented 1 year ago

Oh yeah sorry, I didn't even think about searching this in the existing issues because of how rare it seems to be for "loudness-based" softwares to have this kind of feature. That's cool. Reading through the issue you linked, I guess my post doesn't bring much to the table, feel free to close it if you think it's just adding a duplicate to this subject.

WofWca commented 1 year ago

Thanks a lot for the links! I guess I'll close this as a duplicate, and let's continue the discussion there.

FYI the extension is modular enough in this regard, so if you have an algorithm, it shouldn't be hard to integrate it into the extension. Here's the responsible part.

Duplicate of #46.