Closed raleighlittles closed 2 years ago
Hi @raleighlittles,
it is fine to ask a question here.
The digits are lighter than the background, but the ssocr
default is to assume digits are darker than the background. You can use the --foreground=white
option to change this. This should take care of the number of digits issue. Not specifying a light foreground seems to be the issue you have.
I assume that the time stamps do not use leading zeros for day and month, i.e., --print-spaces
would be needed in general. Your example picture does not need it, because it does not contain an empty space where a digit is possible.
For historical reasons ssocr
assumes the image to contain 6 digits, and uses this as a simple consistency check. Since your image contains 5 digits and an apostrophe, this should actually match. But, with a changing number of digits (one or two digit days, one or two digit months), ssocr
should be told to accept the number of digits it finds with the option --number-digits=-1
That gets us close, but the apostrophe is recognized as an 8, and the spaces do not work as expected (I use -S
(same as --ascii-art-segments
) to illustrate the recognition results):
$ ssocr --foreground=white --number-digits=-1 --print-spaces -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _ _
| _| _| |_| | | | |
| |_ |_ |_| |_| |_|
1 2 280 0
Using the --debug-output
option we find that the apostrophe is taller than accepted for a decimal point:
digit 3: (466,3) -> (495,54), width: 29 ( 4.02%) height: 51 (26.70%)
height/width (int): 1, max_dig_w/width (int): 3, max_dig_h/height (int): 3
We can use --dec-h-ratio=2
to fix that:
$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _
| _| _| | | | |
| |_ |_ . |_| |_|
1 2 2.0 0
Now we can remove the apostrophe with --omit-decimal-point
:
$ ./ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _
| _| _| | | | |
| |_ |_ . |_| |_|
1 2 20 0
We still need to fix the size of space characters. It uses the minimum distance between digits by default, but that is quite small, because the apostrophe is close to the numbers. Just using --space-average
does not work too well:
$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-average -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _
| _| _| | | | |
| |_ |_ . |_| |_|
1 2200
But we should be able to use the --space-factor=X
to adjust this. You probably need to find a setting that works for timestamps with and without missing leading digits, but for this image a --space-factor=2
works:
$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-average --space-factor=2 -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _
| _| _| | | | |
| |_ |_ . |_| |_|
12200
I suspect that using the default minimum distance between digits as base for space character detecting would give more stable results than using the average distance. The average distance varies quite a bit because of the digit 1 and potential space characters. This then requires a larger --space-factor
, perhaps 3:
$ ssocr --foreground=white --number-digits=-1 --print-spaces --dec-h-ratio=2 --omit-decimal-point --space-factor=3 -S ~/timestamp.jpg
Display as seen by ssocr:
_ _ _ _
| _| _| | | | |
| |_ |_ . |_| |_|
12200
You can use the --debug-output
option to get information about digit spacing and so on.
Please let me know if this helps.
@auerswal Thank you so much! Your explanation was very helpful. It seemed to work well and now I know enough to tune things as needed.
Just wanted to start off by saying thank you for releasing this software, and if this is not the right place for this, please feel free to delete it. I searched for ssocr forums but didn't find any,
I'm trying to use
ssocr
to parse the timestamp from scanned film images.(I was trying to use PyTesseract previously, but that was having issues with OCR and I was recommended ssocr instead)
The issue I'm having is that ssocr isn't correctly recognizing the number of digits in the image. Here's the output if I run:
found only 1 of 6 digits
The thresholding works well, but I can't figure out why it's not correctly detecting the number of digits -- I think it has to do with spacing of the digits, since these digits are more horizontally spaced out than the example images on your site.
Here's the settings from the earlier command:
I tried already setting the
space-average
option to accommodate this, but it didn't work, I tried manually specifying the number of digits (5), but the exact same issue happens (except I getfound only 1 of 5 digits
instead), and I also tried changing thespace-factor
setting, but there is little documentation about how it actually works. I see that the default is set to 1.40, so I tried larger values (5 and 10 respectively), but still had the exact same problem.What am I doing wrong here? What can be done to improve the recognition in this example?