OCR-D / ocrd-website

24 stars 8 forks source link

fix/discuss recommended workflows #172

Open bertsky opened 3 years ago

bertsky commented 3 years ago

I am surprised to see the following in our current recommendations:

EDIT (thanks @jbarth-ubhd for reminding me): also

Do these choices have some empirical grounding (measuring quality and/or performance on GT)?

jbarth-ubhd commented 3 years ago

if olena binarization (other methods tested long before), I would recommend wolf:

("Wiener" snippet is from a larger image. Within 1 method left-to-right: +noise. Lines of same contrast at various levels (always dark on bright in this experiment) )

grafik

PS: Default settings applied.

PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization

jbarth-ubhd commented 3 years ago

But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )

jbarth-ubhd commented 3 years ago

Hmm.. we are talking about https://ocr-d.de/en/workflows ?

The "Recommendations" at the end of that page?

bertsky commented 3 years ago

Hmm.. we are talking about https://ocr-d.de/en/workflows ?

The "Recommendations" at the end of that page?

Yes!

But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )

I don't mind that it is not recommended in step 9, because IMHO on the region level orientation is more important than skew, and can differ between region and page, whereas skew is usually uniform across a page (otherwise you usually need dewarping anyway).

The above list really only concerns the overall workflow configuration recommendations.

if olena binarization (other methods tested long before), I would recommend wolf

Thanks again for that test suite and for paying attention in general on that front, which is not appreciated enough IMO. We've briefly discussed this in the chat already, and I would like to elaborate on some open points:

  1. The example image is interesting and helpful to see what's going on, but it is also misleading, because it is highly artificial: Algorithms with local thresholding are quite sensitive to highly localized contrast/brightness changes, esp. if they are discontinuous. But realistically, except for the special case of inverted text, these would be spread more widely and continuously across the page. Even more so for noise, which usually appears equally across the page, which is why raw denoising typically measures noise levels globally.
  2. Binarization does not (have to) stand alone. If we know contrast/brightness is far from normal, and noise is perceptible, then we would run normalization and raw denoising before anyway. Of course some algorithms are more robust against either of those than others. But if we want a fair competition, we should eliminate them (because we can).
  3. Most algorithms have 2 degrees of freedom: the window size (influencing locality; dependent on pixel density) and threshold level (influencing stroke weight). One should allow optimising for them, or at least representing different choices for them. For example, ocrd-skimage-binarize and ocrd-olena-binarize (since v1.1.11) do already set the window size automatically based on a DPI rule of thumb by default. (But this requires having correct DPI annotated.) Niblack is one of those algorithms which is extremely sensitive to the correct window size.
  4. As in point 2, but after binarization: Some algorithms produce noise which can be easily removed with a simple binary denoiser. So for a fair comparison all methods should enjoy that benefit.

So I think we need a different artificial test bed. And then we also need a set of true images of various appearances and qualities.

PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization

I disagree. IMO there should be specialisation and modularisation. So binarization processors can concentrate on its core problem, and others can try to solve related ones. If a processor chooses to do 2 or 3 steps in one, fine (we've seen this elsewhere), but we should always have the option to freely combine. And that in turn means we must do it for a fair evaluation, too.

jbarth-ubhd commented 3 years ago
1. The example image is interesting and helpful to see what's going on, but it is also misleading, 
because it is highly artificial: Algorithms with local thresholding are quite sensitive to highly 
localized contrast/brightness changes, esp. if they are discontinuous. 

The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI. grafik

2. Binarization does not (have to) stand alone. If we know contrast/brightness is far 
from normal, and noise is perceptible, then we would run normalization and raw 
denoising before anyway. Of course some algorithms are more robust against either 
of those than others. But if we want a fair competition, we should eliminate them (because we can

We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.

Will redo with DPI set.

jbarth-ubhd commented 3 years ago

With imagemagick convert -contrast-stretch 1%x7% # assumption: black on white and 300 DPI:

grafik

bertsky commented 3 years ago

The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI.

Wow! They rarely come in as pristine in quality as this!

We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.

If you mean cropping instead of clipping, then yes. And contrast/brightness normalization afterwards.

With imagemagick convert -contrast-stretch 1%x7% # assumption: black on white and 300 DPI:

Thanks! Very interesting. You immediately see the class of algorithms that are most sensitive to level dynamics: Sauvola type (including Kim). They are usually implemented without determinining the r from the input. (For example, Olena/Scribo just uses #define SCRIBO_DEFAULT_SAUVOLA_R 128 in a uint8 space.)

But just to be sure: did you use the window size zero default for Olena (which should set it to the odd-valued number closest to DPI)? I would expect Niblack to look somewhat better...

Also, your contrast-stretch recipe is somewhat different to normalize as used by Fred's IM script textcleaner or by one of ocrd_wrap's presets. I wonder what made you consider increasing the white-out tolerance up to 7%. Perhaps we can get to an OCR-specific optimum preset here?

jbarth-ubhd commented 3 years ago

cropping instead of clipping

yes

did you use the window size zero default for Olena

I did not add any parameter -- defaults only.

I wonder what made you consider increasing the white-out tolerance up to 7%

from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.

7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.

bertsky commented 3 years ago

from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.

7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.

Oh I see. So IIUC your logic goes:

So you could have used convert -contrast-stretch 0.5%x3.5% by the same argument, right?

Also, I'd like to verify that ratio point for concrete scans, because on average I assume initials, separators, ornaments and images will increase the share of black. (I'll make a coarse measurement on a real corpus to check.)

jbarth-ubhd commented 3 years ago

Yes, this the idea.

Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf

bertsky commented 3 years ago

Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf

Yes, we've briefly discussed that in the Lobby. Here is the implementation. Unfortunately does not combine with existing local thresholding algorithms yet.

bertsky commented 3 years ago

So I think we need a different artificial test bed.

As a first step, @jbarth-ubhd could you please change your code to do each point in your matrix (i.e. noise columns, brightness rows) on the full image, and only tile it in the final summary image for visualisation?

jbarth-ubhd commented 3 years ago

Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.

bertsky commented 3 years ago

Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.

Oh, I see! Then I misunderstood you above. In that case you can scratch my point 1 entirely.

So how about running ocrd-skimage-normalize in comparison to your contrast-stretch, and running ocrd-skimage-denoise-raw even before that, and running ocrd-skimage-denoise after binarization? That would be points 2 and 4. Finally, point 3 would be running with different window sizes, say 101 (i.e. smaller than default, more localized) and 401 (i.e. larger), and thresholds, say k=0.1 (i.e. heavier foreground) and k=0.4 (i.e. lighter foreground). If you point me to your implementation, I can help with a PR/patch...

jbarth-ubhd commented 3 years ago

https://digi.ub.uni-heidelberg.de/diglitData/v/various-levels-black-white-noise.tgz . Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl . I could do it on monday.

bertsky commented 3 years ago

Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl

Thanks! I'll give it a shot soon.

But first, I must correct myself significantly regarding my previous recommendations on binarization.

Looking at a representative subset of pages from Deutsches Textarchiv, I found that (contrary to what I said before)…

A. re-binarization after cropping or B. binarization after contrast normalization

…may actually impair quality for most algorithms!

Here is a sample image with heavy show-through: OCR-D-IMG_0001

This is its histogram: grayscale_histogram_explained

And that is (a section of) the result of binarization in 7 of Olena's algorithms, on

where normalization is ocrd-skimage-normalize (i.e. contrast stretching, now with 1% black-point and 7% white-point clipping by default): tiled-spots

As you can see:

IMHO the explanation for this is in the above histogram: Cropping to the page will also cut out the true black from the histogram, leaving foreground ink very close to show-through.

Where do we go from here? How do we formalise this problem so we can include it in the artificial test set above, and possibly address this in the processor implementation already? Would Generalized History Thresholding help?

jbarth-ubhd commented 3 years ago

I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through? Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...

jbarth-ubhd commented 3 years ago

sbb-binarize:

grafik

jbarth-ubhd commented 3 years ago

no contrast stretch before:

grafik

bertsky commented 3 years ago

Would be interesting to see how sbb-binarize copes with normalized and with cropped images. But the message is already clear: Good neural modelling is superior.

I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through?

I don't think this is really an issue practically. We will need (and have) page classification anyway, and having a class empty page beside title, index, content (or whatever) should not be difficult.

Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...

Good idea, but robust heuristic binarization needs to be locally adaptive, so it might be difficult to go global even across the page. Perhaps some algorithms are more suited for this than others. And certainly quality estimation will build on such global statistics.

jbarth-ubhd commented 3 years ago

Just for completeness: ocropus-nlbin with -n; not normalized before: grafik

bertsky commented 3 years ago

Just for completeness: ocropus-nlbin with -n; not normalized before:

Should be the same as ocrd-cis-ocropy-binarize, right?

jbarth-ubhd commented 3 years ago
(venv) xx@yy:~/ocrd_all> find . -name "*.py"|egrep -v '/venv/'|xargs grep -3  percentile_filter
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            # if not, we need to flatten it by estimating the local whitelevel
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            LOG.info("Flattening")
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            m = interpolation.zoom(image, self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py:            m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-                m, self.parameter['perc'], size=(self.parameter['range'], 2))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py:            m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-                m, self.parameter['perc'], size=(2, self.parameter['range']))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            m = interpolation.zoom(m, 1.0/self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            if self.parameter['debug'] > 0:
grep: ./ocrd_olena/repo/olena/dynamic-use-of-static-c++/swig/python/ltihooks.py: Datei oder Verzeichnis nicht gefunden
--
./ocrd_cis/ocrd_cis/ocropy/common.py-        warnings.simplefilter('ignore')
./ocrd_cis/ocrd_cis/ocropy/common.py-        # calculate at reduced pixel density to save CPU time
./ocrd_cis/ocrd_cis/ocropy/common.py-        m = interpolation.zoom(image, zoom, mode='nearest')
./ocrd_cis/ocrd_cis/ocropy/common.py:        m = filters.percentile_filter(m, perc, size=(range_, 2))
./ocrd_cis/ocrd_cis/ocropy/common.py:        m = filters.percentile_filter(m, perc, size=(2, range_))
./ocrd_cis/ocrd_cis/ocropy/common.py-        m = interpolation.zoom(m, 1. / zoom)
./ocrd_cis/ocrd_cis/ocropy/common.py-    ##w, h = np.minimum(np.array(image.shape), np.array(m.shape))
./ocrd_cis/ocrd_cis/ocropy/common.py-    ##flat = np.clip(image[:w, :h] - m[:w, :h] + 1, 0, 1)
bertsky commented 3 years ago

@jbarth-ubhd I'm not sure what you want to say with this. But here's a comparison of both wrappers for old ocropus-nlbin:

anybaseocr-binarize cis-ocropy-binarize
OCR-D wrapper only formally correct OCR-D wrapper adequate
exposes all params as is exposes only relevant ones, controls for others (e.g. zoom via DPI)
original code without changes includes some fixes:
* pixel-correct image size
* robustness against NaN
* zoom and plausibilise sizes relative to DPI
* opt-in for additional deskewing and/or denoising
* opt-in for grayscale normalization
jbarth-ubhd commented 3 years ago

Should be the same as ocrd-cis-ocropy-binarize, right?

Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.

bertsky commented 3 years ago

Should be the same as ocrd-cis-ocropy-binarize, right?

Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.

Oh, now I got it. (Sorry, have never found the time to summarise my ocropy/ocrolib changes and re-expose the old ocropus CLIs from it.)

My question was actually whether the picture looks any different when you use the OCR-D wrapper.