auerswal / ssocr

Seven Segment Optical Character Recognition
https://www.unix-ag.uni-kl.de/~auerswal/ssocr/index.html
GNU General Public License v3.0
202 stars 38 forks source link

Seven segment font in scanned image timestamps -- how to use the `space-factor` setting? #19

Closed raleighlittles closed 2 years ago

raleighlittles commented 2 years ago

Just wanted to start off by saying thank you for releasing this software, and if this is not the right place for this, please feel free to delete it. I searched for ssocr forums but didn't find any,

I'm trying to use ssocr to parse the timestamp from scanned film images.

timestamp

(I was trying to use PyTesseract previously, but that was having issues with OCR and I was recommended ssocr instead)

The issue I'm having is that ssocr isn't correctly recognizing the number of digits in the image. Here's the output if I run:

$ ssocr -Dillustrate_algo_ts.png -T --omit-decimal-point --debug-output --print-info --space-average timestamp.jpg

found only 1 of 6 digits

illustrate_algo_ts

The thresholding works well, but I can't figure out why it's not correctly detecting the number of digits -- I think it has to do with spacing of the digits, since these digits are more horizontally spaced out than the example images on your site.

Here's the settings from the earlier command:

flags & PRINT_INFO=32
flags & SPC_USE_AVG_DST=4096
================================================================================
flags & VERBOSE=4
thresh=50.000000
flags & PRINT_INFO=32
flags & ADJUST_GRAY=0
flags & ABSOLUTE_THRESHOLD=0
flags & DO_ITERATIVE_THRESHOLD=2
flags & USE_DEBUG_IMAGE=8
flags & DEBUG_OUTPUT=32
flags & PROCESS_ONLY=0
flags & ASCII_ART_SEGMENTS=0
flags & PRINT_AS_HEX=0
flags & OMIT_DECIMAL=1024
flags & PRINT_SPACES=0
flags & SPC_USE_AVG_DST=4096
need_pixels = 1
ignore_pixels = 0
number_of_digits = 6
foreground = 0 (black)
background = 255 (white)
luminance  = Rec709
charset    = full
height/width threshold for one   = 3
width/height threshold for minus = 2
max_dig_h/h threshold for decimal = 5
max_dig_w/w threshold for decimal = 2
distance factor for adding spaces = 1.40
optind=7 argc=8
================================================================================
argv[argc-1]=timestamp.jpg used as image file name
loading image timestamp.jpg
image width: 761
image height: 192
13.00 <= lum <= 170.00 (lum should be in [0,255])
adjusting threshold to image: 50.000000 -> 35.882353
doing iterative_thresholding: 35.882353 -> 32.156863
using threshold 32.16
no commands given, using image timestamp.jpg unmodified
found only 1 of 6 digits
using png format for debug image
writing debug image to file illustrate_algo_ts.png

I tried already setting the space-average option to accommodate this, but it didn't work, I tried manually specifying the number of digits (5), but the exact same issue happens (except I get found only 1 of 5 digits instead), and I also tried changing the space-factor setting, but there is little documentation about how it actually works. I see that the default is set to 1.40, so I tried larger values (5 and 10 respectively), but still had the exact same problem.

What am I doing wrong here? What can be done to improve the recognition in this example?

auerswal commented 2 years ago

Hi @raleighlittles,

it is fine to ask a question here.

The digits are lighter than the background, but the ssocr default is to assume digits are darker than the background. You can use the --foreground=white option to change this. This should take care of the number of digits issue. Not specifying a light foreground seems to be the issue you have.

I assume that the time stamps do not use leading zeros for day and month, i.e., --print-spaces would be needed in general. Your example picture does not need it, because it does not contain an empty space where a digit is possible.

For historical reasons ssocr assumes the image to contain 6 digits, and uses this as a simple consistency check. Since your image contains 5 digits and an apostrophe, this should actually match. But, with a changing number of digits (one or two digit days, one or two digit months), ssocr should be told to accept the number of digits it finds with the option --number-digits=-1

That gets us close, but the apostrophe is recognized as an 8, and the spaces do not work as expected (I use -S (same as --ascii-art-segments) to illustrate the recognition results):

$ ssocr --foreground=white --number-digits=-1 --print-spaces -S ~/timestamp.jpg 
Display as seen by ssocr:
         _      _   _   _      _ 
   |     _|     _| |_| | |    | |
   |    |_     |_  |_| |_|    |_|

1 2 280 0

Using the --debug-output option we find that the apostrophe is taller than accepted for a decimal point:

digit 3: (466,3) -> (495,54), width: 29 ( 4.02%) height: 51 (26.70%)
  height/width (int): 1, max_dig_w/width (int): 3, max_dig_h/height (int): 3

We can use --dec-h-ratio=2 to fix that:

$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 -S ~/timestamp.jpg 
Display as seen by ssocr:
         _      _       _      _ 
   |     _|     _|     | |    | |
   |    |_     |_   .  |_|    |_|

1 2 2.0 0

Now we can remove the apostrophe with --omit-decimal-point:

$ ./ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point -S ~/timestamp.jpg 
Display as seen by ssocr:
         _      _       _      _ 
   |     _|     _|     | |    | |
   |    |_     |_   .  |_|    |_|

1 2 20 0

We still need to fix the size of space characters. It uses the minimum distance between digits by default, but that is quite small, because the apostrophe is close to the numbers. Just using --space-average does not work too well:

$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-average -S ~/timestamp.jpg 
Display as seen by ssocr:
         _   _       _   _ 
   |     _|  _|     | | | |
   |    |_  |_   .  |_| |_|

1 2200

But we should be able to use the --space-factor=X to adjust this. You probably need to find a setting that works for timestamps with and without missing leading digits, but for this image a --space-factor=2 works:

$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-average --space-factor=2 -S ~/timestamp.jpg 
Display as seen by ssocr:
      _   _       _   _ 
   |  _|  _|     | | | |
   | |_  |_   .  |_| |_|

12200

I suspect that using the default minimum distance between digits as base for space character detecting would give more stable results than using the average distance. The average distance varies quite a bit because of the digit 1 and potential space characters. This then requires a larger --space-factor, perhaps 3:

$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-factor=3 -S ~/timestamp.jpg 
Display as seen by ssocr:
      _   _       _   _ 
   |  _|  _|     | | | |
   | |_  |_   .  |_| |_|

12200

You can use the --debug-output option to get information about digit spacing and so on.

Please let me know if this helps.

raleighlittles commented 2 years ago

@auerswal Thank you so much! Your explanation was very helpful. It seemed to work well and now I know enough to tune things as needed.