Open bertsky opened 4 years ago
if olena binarization (other methods tested long before), I would recommend wolf:
("Wiener" snippet is from a larger image. Within 1 method left-to-right: +noise. Lines of same contrast at various levels (always dark on bright in this experiment) )
PS: Default settings applied.
PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization
But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )
Hmm.. we are talking about https://ocr-d.de/en/workflows ?
The "Recommendations" at the end of that page?
Hmm.. we are talking about https://ocr-d.de/en/workflows ?
The "Recommendations" at the end of that page?
Yes!
But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )
I don't mind that it is not recommended
in step 9, because IMHO on the region level orientation is more important than skew, and can differ between region and page, whereas skew is usually uniform across a page (otherwise you usually need dewarping anyway).
The above list really only concerns the overall workflow configuration recommendations.
if olena binarization (other methods tested long before), I would recommend wolf
Thanks again for that test suite and for paying attention in general on that front, which is not appreciated enough IMO. We've briefly discussed this in the chat already, and I would like to elaborate on some open points:
So I think we need a different artificial test bed. And then we also need a set of true images of various appearances and qualities.
PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization
I disagree. IMO there should be specialisation and modularisation. So binarization processors can concentrate on its core problem, and others can try to solve related ones. If a processor chooses to do 2 or 3 steps in one, fine (we've seen this elsewhere), but we should always have the option to freely combine. And that in turn means we must do it for a fair evaluation, too.
1. The example image is interesting and helpful to see what's going on, but it is also misleading, because it is highly artificial: Algorithms with local thresholding are quite sensitive to highly localized contrast/brightness changes, esp. if they are discontinuous.
The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI.
2. Binarization does not (have to) stand alone. If we know contrast/brightness is far from normal, and noise is perceptible, then we would run normalization and raw denoising before anyway. Of course some algorithms are more robust against either of those than others. But if we want a fair competition, we should eliminate them (because we can
We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.
Will redo with DPI set.
With imagemagick convert -contrast-stretch 1%x7% # assumption: black on white
and 300 DPI:
The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI.
Wow! They rarely come in as pristine in quality as this!
We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.
If you mean cropping instead of clipping, then yes. And contrast/brightness normalization afterwards.
With
imagemagick convert -contrast-stretch 1%x7% # assumption: black on white
and 300 DPI:
Thanks! Very interesting. You immediately see the class of algorithms that are most sensitive to level dynamics: Sauvola type (including Kim). They are usually implemented without determinining the r
from the input. (For example, Olena/Scribo just uses #define SCRIBO_DEFAULT_SAUVOLA_R 128
in a uint8 space.)
But just to be sure: did you use the window size zero default for Olena (which should set it to the odd-valued number closest to DPI)? I would expect Niblack to look somewhat better...
Also, your contrast-stretch
recipe is somewhat different to normalize as used by Fred's IM script textcleaner or by one of ocrd_wrap's presets. I wonder what made you consider increasing the white-out tolerance up to 7%. Perhaps we can get to an OCR-specific optimum preset here?
cropping instead of clipping
yes
did you use the window size zero default for Olena
I did not add any parameter -- defaults only.
I wonder what made you consider increasing the white-out tolerance up to 7%
from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.
7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.
from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.
7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.
Oh I see. So IIUC your logic goes:
So you could have used convert -contrast-stretch 0.5%x3.5%
by the same argument, right?
Also, I'd like to verify that ratio point for concrete scans, because on average I assume initials, separators, ornaments and images will increase the share of black. (I'll make a coarse measurement on a real corpus to check.)
Yes, this the idea.
Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf
Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf
Yes, we've briefly discussed that in the Lobby. Here is the implementation. Unfortunately does not combine with existing local thresholding algorithms yet.
So I think we need a different artificial test bed.
As a first step, @jbarth-ubhd could you please change your code to do each point in your matrix (i.e. noise columns, brightness rows) on the full image, and only tile it in the final summary image for visualisation?
Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.
Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.
Oh, I see! Then I misunderstood you above. In that case you can scratch my point 1 entirely.
So how about running ocrd-skimage-normalize
in comparison to your contrast-stretch, and running ocrd-skimage-denoise-raw
even before that, and running ocrd-skimage-denoise
after binarization? That would be points 2 and 4. Finally, point 3 would be running with different window sizes, say 101 (i.e. smaller than default, more localized) and 401 (i.e. larger), and thresholds, say k=0.1
(i.e. heavier foreground) and k=0.4
(i.e. lighter foreground). If you point me to your implementation, I can help with a PR/patch...
https://digi.ub.uni-heidelberg.de/diglitData/v/various-levels-black-white-noise.tgz . Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl . I could do it on monday.
Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl
Thanks! I'll give it a shot soon.
But first, I must correct myself significantly regarding my previous recommendations on binarization.
Looking at a representative subset of pages from Deutsches Textarchiv, I found that (contrary to what I said before)…
A. re-binarization after cropping or B. binarization after contrast normalization
…may actually impair quality for most algorithms!
Here is a sample image with heavy show-through:
This is its histogram:
And that is (a section of) the result of binarization in 7 of Olena's algorithms, on
where normalization is ocrd-skimage-normalize
(i.e. contrast stretching, now with 1% black-point and 7% white-point clipping by default):
As you can see:
niblack
is quite invariant (but unusable; perhaps a search for better k
might help)otsu
, wolf
and sauvola-ms
already become unusable due to show-through when using only the cropped imagesauvola
, kim
and singh
become unusable due to show-through when using the cropped image and normalizing itwolf
and sauvola-ms
(but not sauvola
proper!) look pretty much like otsu
when the dynamics are improperkim
is always too "light" (perhaps a search for better k
might help)singh
is always a little noisy (perhaps binary denoising afterwards might help)niblack
)IMHO the explanation for this is in the above histogram: Cropping to the page will also cut out the true black from the histogram, leaving foreground ink very close to show-through.
Where do we go from here? How do we formalise this problem so we can include it in the artificial test set above, and possibly address this in the processor implementation already? Would Generalized History Thresholding help?
I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through? Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...
sbb-binarize:
no contrast stretch before:
Would be interesting to see how sbb-binarize
copes with normalized and with cropped images. But the message is already clear: Good neural modelling is superior.
I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through?
I don't think this is really an issue practically. We will need (and have) page classification anyway, and having a class empty page beside title, index, content (or whatever) should not be difficult.
Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...
Good idea, but robust heuristic binarization needs to be locally adaptive, so it might be difficult to go global even across the page. Perhaps some algorithms are more suited for this than others. And certainly quality estimation will build on such global statistics.
Just for completeness: ocropus-nlbin with -n
; not normalized before:
Just for completeness: ocropus-nlbin with
-n
; not normalized before:
Should be the same as ocrd-cis-ocropy-binarize
, right?
(venv) xx@yy:~/ocrd_all> find . -name "*.py"|egrep -v '/venv/'|xargs grep -3 percentile_filter
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- # if not, we need to flatten it by estimating the local whitelevel
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- LOG.info("Flattening")
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- m = interpolation.zoom(image, self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py: m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- m, self.parameter['perc'], size=(self.parameter['range'], 2))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py: m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- m, self.parameter['perc'], size=(2, self.parameter['range']))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- m = interpolation.zoom(m, 1.0/self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py- if self.parameter['debug'] > 0:
grep: ./ocrd_olena/repo/olena/dynamic-use-of-static-c++/swig/python/ltihooks.py: Datei oder Verzeichnis nicht gefunden
--
./ocrd_cis/ocrd_cis/ocropy/common.py- warnings.simplefilter('ignore')
./ocrd_cis/ocrd_cis/ocropy/common.py- # calculate at reduced pixel density to save CPU time
./ocrd_cis/ocrd_cis/ocropy/common.py- m = interpolation.zoom(image, zoom, mode='nearest')
./ocrd_cis/ocrd_cis/ocropy/common.py: m = filters.percentile_filter(m, perc, size=(range_, 2))
./ocrd_cis/ocrd_cis/ocropy/common.py: m = filters.percentile_filter(m, perc, size=(2, range_))
./ocrd_cis/ocrd_cis/ocropy/common.py- m = interpolation.zoom(m, 1. / zoom)
./ocrd_cis/ocrd_cis/ocropy/common.py- ##w, h = np.minimum(np.array(image.shape), np.array(m.shape))
./ocrd_cis/ocrd_cis/ocropy/common.py- ##flat = np.clip(image[:w, :h] - m[:w, :h] + 1, 0, 1)
@jbarth-ubhd I'm not sure what you want to say with this. But here's a comparison of both wrappers for old ocropus-nlbin:
anybaseocr-binarize | cis-ocropy-binarize |
---|---|
OCR-D wrapper only formally correct | OCR-D wrapper adequate |
exposes all params as is | exposes only relevant ones, controls for others (e.g. zoom via DPI) |
original code without changes | includes some fixes: |
* pixel-correct image size | |
* robustness against NaN | |
* zoom and plausibilise sizes relative to DPI | |
* opt-in for additional deskewing and/or denoising | |
* opt-in for grayscale normalization |
Should be the same as ocrd-cis-ocropy-binarize, right?
Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.
Should be the same as ocrd-cis-ocropy-binarize, right?
Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.
Oh, now I got it. (Sorry, have never found the time to summarise my ocropy/ocrolib changes and re-expose the old ocropus CLIs from it.)
My question was actually whether the picture looks any different when you use the OCR-D wrapper.
I am surprised to see the following in our current recommendations:
skimage
binarize/denoise processors instead of Olena/OcropyEDIT (thanks @jbarth-ubhd for reminding me): also
Do these choices have some empirical grounding (measuring quality and/or performance on GT)?