apm1467 / videocr

Extract hardcoded subtitles from videos using machine learning
MIT License
506 stars 117 forks source link

Tesseract output improvement #5

Open halfguru opened 4 years ago

halfguru commented 4 years ago

Hi,

First of all, thank you for your work. I was looking for OCR projects since it's very difficult to find english subtitles for chinese youtube shows.

I'm wondering if you've attempted to optimize the Tesseract output with different image processing techniques as illustrated here. The use_fullframe argument could be changed to specific rectangular coordinates. Also, the Tesseract wiki indicates a dark text with light background is preferable so adding an option to invert the colors could be helpful. Binarisation could also help further isolate the subtitles. Finally, I believe adding the --psm 6 option to the Tesseract config to indicate a single uniform block of text would be beneficial.

mongy910 commented 3 years ago

@halfguru These are really good insights. In the year since you've posted this, have you found any better solutions? I have the same use case as you (reading chinese soft captions).