[Feature Request] OCR Video Frame to Text

zeem12 commented 4 weeks ago

Yes, I’ve checked out issue #340 and others, but I do think my concept is a bit different from what they've requested before.

Right now, I’m subbing a Japanese TV program. As you might know, they often use large subtitles or colorful captions (the image below is just an example).

1680883136640

To speed up my workflow, I usually take a screenshot, run it through an OCR engine like Google Lens, and paste it back into Subtitle Edit. But I was wondering, could this be done faster and simultaneously?

My idea is to create a feature similar to Whisper, but instead of converting selected audio lines to text, it converts one video frame from the start/middle/end of the selected lines into an image and then converts it to text with Tesseract or another OCR engine.

So it doesn’t have to automatically detect all of the start and end by capturing each frame of the hardcoded subtitles, it just extracts one frame from every selected line (which can be done with ffmpeg like in the selected lines feature of Whisper) and then OCRs it.

I think this could be a game-changing feature, especially for those who work with Japanese/Korean TV and variety shows which use a lot of text on their screen.

epubc commented 4 weeks ago

Have you tried the VideoSubFinder program? I think it meets your requirements.

zeem12 commented 4 weeks ago

Have you tried the VideoSubFinder program? I think it meets your requirements.

I’ve already tried VideoSubFinder, It’s not quite efficient for my needs. I’m hoping to do everything in one window within Subtitle Edit itself, similar to how we transcribe audio from the selected lines with Whisper in a single click. I’d also like more flexibility to manually crop and time the subtitles.

VideoSubFinder captures all text within a constant “capture window”, and the timing is automatically generated. However, captions in Japanese TV programs don’t usually have a fixed or constant position. I also don’t need VideoSubFinder’s auto timing feature since I prefer to time the subtitles manually. Moreover, I only need to OCR some of the hardcoded captions that I select, not all of them automatically like VideoSubFinder does.

The feature I’m requesting would be beneficial for those working in teams or collaborations. For instance, one person could create the timing for the hardcoded captions, and another could translate them. So they could simultaneously extract text based on the timing that’s already been set. This could really streamline the process and make collaborative work much more efficient.

SubtitleEdit / subtitleedit

[Feature Request] OCR Video Frame to Text #8474