OCR rules - Githubissues

SubtitleEdit / subtitleedit

the subtitle editor :)

http://www.nikse.dk/SubtitleEdit/Help

GNU General Public License v3.0

8.71k stars 908 forks source link

OCR rules #5038

Closed iGom closed 6 months ago

iGom commented 3 years ago

Many Polish words end with "-ej" letters, OCR often detects the ending "ej" as "e]" so is it possible to somehow create a rule for OCR that will always change the ending "e]" to "ej"?

niksedk commented 3 years ago

Hm, you could try:

a <PartialLines> entry <LinePart from="e] " to="ej " />

a <PartialWords> entry <WordPart from="e]" to="ej" />

or some regular expression...

I guess you'll have to be careful not to "fix" correct ] tags.

iGom commented 3 years ago

Unfortunately, both rules have no effect. I'm left with adding whole words to replace list for now. Also, I forgot to mention that the problem with those endings only occurs in the "Original Tesseract" engine

niksedk commented 3 years ago

I could try to look a bit more if you attached the sub (could just be a few lines, in SE you can delete lines and export as bd sup).

If the font is not too small (dvd like), you could try nOCR where you have more control of the trained of characters - see https://www.nikse.dk/SubtitleEdit/nocr

iGom commented 3 years ago

Some subs are from Blu-ray, some from DVD Test sup EJ.zip

I'll take a look at nOCR

niksedk commented 3 years ago

I've tried with both Tesseract 3.02 and 5 beta... and I don't get any "e]"

iGom commented 3 years ago

As I mentioned before it only occurs in the "Original Tesseract only" engine, which I usually use because it detects italics.

obraz