apm1467 / videocr

Extract hardcoded subtitles from videos using machine learning
MIT License
506 stars 117 forks source link

videocr

Extract hardcoded (burned-in) subtitles from videos using the Tesseract OCR engine with Python.

Input a video with hardcoded subtitles:

screenshot screenshot

# example.py

from videocr import get_subtitles

if __name__ == '__main__':  # This check is mandatory for Windows.
    print(get_subtitles('video.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))

$ python3 example.py

Output:

0
00:00:01,042 --> 00:00:02,877
喝 点 什么 ? 
What can I get you?

1
00:00:03,044 --> 00:00:05,463
我 不 知道
Um, I'm not sure.

2
00:00:08,091 --> 00:00:10,635
休闲 时 光 …
For relaxing times, make it...

3
00:00:10,677 --> 00:00:12,595
三 得 利 时 光
Bartender, Bob Suntory time.

4
00:00:14,472 --> 00:00:17,142
我 要 一 杯 伏特 加
Un, I'll have a vodka tonic.

5
00:00:18,059 --> 00:00:19,019
谢谢
Laughs Thanks.

Performance

The OCR process is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. More CPU cores will make it faster.

Installation

  1. Install Tesseract and make sure it is in your $PATH

  2. $ pip install videocr

API

  1. Return subtitle string in SRT format

    get_subtitles(
        video_path: str, lang='eng', time_start='0:00', time_end='',
        conf_threshold=65, sim_threshold=90, use_fullframe=False)
  2. Write subtitles to file_path

    save_subtitles_to_file(
        video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00', time_end='',
        conf_threshold=65, sim_threshold=90, use_fullframe=False)

Parameters