OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

add image preprocessing steps #159

Open bertsky opened 4 years ago

bertsky commented 4 years ago

IMO there is a large, still unmet demand in OCR-D for image preprocessing tools to

  1. color-normalize raw images (i.e. linear or non-linear contrast stretching, gamma correction)
  2. denoise raw images (i.e. luminance/grayscale or color denoising before binarization)

Most binarization algorithms depend on this. For example, Sauvola (unless it exposes the R parameter and one can estimate a good fit from the image dynamics) assumes full dynamic range.

So how about adding the following:

bertsky commented 4 years ago
* in the METS specs, new `fileGrp/@USE` name recommendations `OCR-D-IMG-NORM` and `OCR-D-IMG-RAWDEN`

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

* in the PAGE specs, new `AlternativeImage/@comments` classes `normalized` and `raw-denoised`

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

kba commented 4 years ago

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

:+1:

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

IMHO "raw denoising" is clearer than distinguishing despeckling/denoising. Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

And "denoise" is not introduced at all. So, we're free to define it as you proposed. @EEngl52 any objection?

bertsky commented 4 years ago

Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

Oh, but these physical artifacts cannot be reliably removed after binarization IMHO. You need special detectors on raw colors. So if that's the term OCR-D (or the OCR community in general) has agreed upon, let's stick to that, and not project any other interpretation. In that sense I think we still have no despeckling processors yet.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

EEngl52 commented 4 years ago

IMO we could differentiate denoising/despeckling. But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling. So it would probably be easier to go with @bertsky 's last suggestion on raw and bilevel denoising and to actually define denoising in the glossary

bertsky commented 4 years ago

But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling.

Absolutely. Since despeckling was all we had, the current denoising processors all use that (in @comments and tool json):

We should open respective issues in those repos, and in the workflow guide of course.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

So how about: