SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.79k stars 910 forks source link

How to create a OCR Dictionary for lang that download list doesn't provide an option #3443

Closed Araynilmar closed 2 years ago

Araynilmar commented 5 years ago

When I use the "Binaray Image Compare" method for OCR subtitles, I would like to have a dictionary to help me correct the OCR errors, but unfortunately the language is not in the language list to download.

So I want to create a dictionary myself for Sino-Tibetan languages, Such as Chinese, Taiwanese, Cantonese etc that download list doesn't provide an option.

Where can I find the Property Expansion of the OCR Dictionary?

niksedk commented 5 years ago

What OCR method are you using? If you're using Tesseract read more here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Araynilmar commented 5 years ago

What OCR method are you using? If you're using Tesseract read more here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Binaray Image Compare.

Tesseract method is so inaccuracy for Chinese, It's not a good choice.

gabriellluz commented 5 years ago

I always import my images on Finereader. It works great for Cantonese, Mandarin and Japanese. All you have to do is export the subtitle images and then import them on Finereader. Then, you choose Save one outputfile for each input file.

niksedk commented 5 years ago

Hm, the OCR fix list works with spell check... and if a hunspell dictionary does not exist (I could not find one for Chinese) then it's a bit hard - you could try to create an empty hunspell dictionary (like this: zh-CN.zip ) and use <PartialLinesAlways> and <PartialWordsAlways> from the zho_OCRFixReplaceList.xml.

supermansaga commented 2 years ago

In 2022, do we have better solutions now? The latest Subtitle Edit 3.6.4 portable still doesn't have a dedicated Chinese dictionary to handle .sup subtitles. Is (Abbyy) Finereader free? Doesn't seem like it. It also doesn't directly support .sup file format.

The latest Tesseract is now 5.0.0. Subtitle Edit 3.6.4 can download it instead of relying on the default and old 3.0.2. However, still so many mistakes when it comes to recognition. I now follow why Araynilmar said that above. Thx