Closed Araynilmar closed 2 years ago
What OCR method are you using? If you're using Tesseract read more here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
What OCR method are you using? If you're using Tesseract read more here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
Binaray Image Compare.
Tesseract method is so inaccuracy for Chinese, It's not a good choice.
I always import my images on Finereader. It works great for Cantonese, Mandarin and Japanese. All you have to do is export the subtitle images and then import them on Finereader. Then, you choose Save one outputfile for each input file.
Hm, the OCR fix list works with spell check... and if a hunspell dictionary does not exist (I could not find one for Chinese) then it's a bit hard - you could try to create an empty hunspell dictionary (like this:
zh-CN.zip ) and use <PartialLinesAlways>
and <PartialWordsAlways>
from the zho_OCRFixReplaceList.xml
.
In 2022, do we have better solutions now? The latest Subtitle Edit 3.6.4 portable still doesn't have a dedicated Chinese dictionary to handle .sup subtitles. Is (Abbyy) Finereader free? Doesn't seem like it. It also doesn't directly support .sup file format.
The latest Tesseract is now 5.0.0. Subtitle Edit 3.6.4 can download it instead of relying on the default and old 3.0.2. However, still so many mistakes when it comes to recognition. I now follow why Araynilmar said that above. Thx
When I use the "Binaray Image Compare" method for OCR subtitles, I would like to have a dictionary to help me correct the OCR errors, but unfortunately the language is not in the language list to download.
So I want to create a dictionary myself for Sino-Tibetan languages, Such as Chinese, Taiwanese, Cantonese etc that download list doesn't provide an option.
Where can I find the Property Expansion of the OCR Dictionary?