SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.49k stars 895 forks source link

OCR via Binary Image Compare issues #2656

Closed Belzak56 closed 2 years ago

Belzak56 commented 6 years ago

As advised here:

https://github.com/SubtitleEdit/subtitleedit/issues/2643

I've tried "Binary Image Compare" and it was very helpful indeed.

Nevertheless, there was few issues I would like to point out here:

  1. Numbers have been recognized reversely when the value of "No of pixels is space" is (5) e.g. 365 will be read as 536 If the value of "No of pixels is space" changed to (7) this won't happen, but it will lead to another problem which is explained in point no. 2 below. I'm guessing this might be resolved by allowing decimals in "No of pixels is space" settings.

  2. The issue of (2) lines overlapping. I've been able to overcome this issue manually by choosing a value of (50) for "Min. line height (split)" and when I'm facing another two lines overlapping I'm reducing the value to (40) or (45) and so on. I would appreciate if you could allow for the value of "Min. line height (split)" to be incremental by +1 from 40 till 50 instead of +5 to achieve the ideal line height instead of going back and forth to adjust it manually.

  3. There are two letters (Ra'a) and (Zai, same as Ra'a but with dot above it) in Arabic on which if the letter (Alef) and the Parentheses comes after them they will be recognized as space as shown below:

24172380_10213814593454516_1546244860_n 24204763_10213814593654521_578320648_n 24251777_10213814593214510_1022454054_n 24257730_10213814593694522_124100201_n 24272913_10213814592934503_544417636_n 24205030_10213814593574519_555467721_n 24252306_10213814593854526_1992780140_n

I'm not sure why this is happening only for letter (Alef) and the Parentheses and not for any other letters, but it might be related to the letter (Alef) and Parentheses width and that letter (Alef) is coming within the range of letter ((Ra'a) and (Zai).

niksedk commented 6 years ago

Are you running latest beta?

Could you email or upload the sub?

Belzak56 commented 6 years ago

I'm using the latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.4/SubtitleEditBeta.zip

The subtitle is below: Mv84.zip

niksedk commented 6 years ago

Thx for the sub - beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.4/SubtitleEditBeta.zip

2+ 3 should be improved.

if "1" still occurs do you have some samples?

Belzak56 commented 6 years ago

2 + 3 are working perfectly now. Thanks Niske!

Unfortunately, other issue have occurred.

The OCR recognition is now ignoring: The quotation mark " The Arabic Comma ، The Letter (Alef) The Parenthesis (some times)

Noting that Max. error % : 0.0

below are some examples:

## Line 1 of Subtitle: issues_binary_arabic_03

## Line 138 of Subtitle: issues_binary_arabic_01

## Line 139 of Subtitle: issues_binary_arabic_02

As for "1" it seems that this is resulting from the option "Right to left" is ticked. If this option is not ticked then the order of the numbers will be correct. Now I don't know how this can be fixed as "Right to left" option is necessary for Arabic Language. I would suggest if it is possible to treat digits only as "LTR" instead of "RTL".

This can be seen in the lines: 202 - 204 - 206 - 209 - 309 - 310 - 311 - 312 of the same subtitle attached in the previous post.

niksedk commented 6 years ago

Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.4/SubtitleEditBeta.zip

Did you have a line where I could test the "reversed numbers issue" ?

Belzak56 commented 6 years ago

This Beta works perfectly. At least for this font. Thanks a lot Nikse.

For the reversed numbers issue, first please find the below table for Arabic and English Numbers for your reference:

numbers

Below are samples from the same subtitle:

## Line 202: Recognized: 30:12 Correct: 12:30

202

## Line 204: Recognized: 552 and 64 2 Correct: 255 and 642

204

## Line 206: Recognized: 070 Correct: 700

206

## Line 209: Recognized: 70 0 Correct: 700

209

Please note that this is happening in both cases of choosing Arabic or English Numbers for the output text.

Belzak56 commented 6 years ago

@niksedk

I did intensive tests on the latest beta the you've provided:

https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.4/SubtitleEditBeta.zip

My observations are below:

  1. The Overlapping issue have been resolved for this font. It is happening very rarely now and it can be handled very easily. For other fonts it is not happening at all.

  2. I've managed to resolve the issue of "Reversed Numbers" by using Regx.

  3. Regx is very useful and helped me to tackle many minor issues, but I would like to ask if it is possible to add "Comments" or "Remarks" Tab at the end near "Search Type" Tab. I know that I can write a comment in the same line by adding (?# ..... ), but it would be much easier if the comments were in a separate column.

24989290_10213889849855879_352521586_n

  1. Although the Recognition became near perfect in this latest beta, I faced an issue with the space between letters. If I set "No of pixels is space" to 4 then there will be space between certain letters in the same word. If I set "No of pixels is space" to 5 there will be no space, but then two separate words will be connected.

below are (2) examples from this subtitle:

MV41.zip

## Line 10, No of pixels is space: 4

line10_pixel4

## Line 10, No of pixels is space: 5

line10_pixel5

## Line 305, No of pixels is space: 4

line305_pixel4

## Line 305, No of pixels is space: 5

line305_pixel5

My suggestion to resolve this issue is to add decimals in "No of pixels is space" settings, but I don't know if this is applicable programmatically.