Closed iGom closed 6 months ago
Hm, you could try:
a <PartialLines>
entry <LinePart from="e] " to="ej " />
or
a <PartialWords>
entry <WordPart from="e]" to="ej" />
or some regular expression...
I guess you'll have to be careful not to "fix" correct ]
tags.
Unfortunately, both rules have no effect. I'm left with adding whole words to replace list for now. Also, I forgot to mention that the problem with those endings only occurs in the "Original Tesseract" engine
I could try to look a bit more if you attached the sub (could just be a few lines, in SE you can delete lines and export as bd sup).
If the font is not too small (dvd like), you could try nOCR where you have more control of the trained of characters - see https://www.nikse.dk/SubtitleEdit/nocr
Some subs are from Blu-ray, some from DVD Test sup EJ.zip
I'll take a look at nOCR
I've tried with both Tesseract 3.02 and 5 beta... and I don't get any "e]"
As I mentioned before it only occurs in the "Original Tesseract only" engine, which I usually use because it detects italics.
Many Polish words end with "-ej" letters, OCR often detects the ending "ej" as "e]" so is it possible to somehow create a rule for OCR that will always change the ending "e]" to "ej"?