Open bertsky opened 4 years ago
* in the METS specs, new `fileGrp/@USE` name recommendations `OCR-D-IMG-NORM` and `OCR-D-IMG-RAWDEN`
should now read: OCR-D-PRE-NORM
and OCR-D-PRE-RAWDEN
* in the PAGE specs, new `AlternativeImage/@comments` classes `normalized` and `raw-denoised`
Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...
should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN
:+1:
Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...
IMHO "raw denoising" is clearer than distinguishing despeckling/denoising. Then again, our glossary currently defines despeckling as
Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.
And "denoise" is not introduced at all. So, we're free to define it as you proposed. @EEngl52 any objection?
Then again, our glossary currently defines despeckling as
Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.
Oh, but these physical artifacts cannot be reliably removed after binarization IMHO. You need special detectors on raw colors. So if that's the term OCR-D (or the OCR community in general) has agreed upon, let's stick to that, and not project any other interpretation. In that sense I think we still have no despeckling processors yet.
And "denoise" is not introduced at all.
Then let's define it! Let's also differentiate between raw and bilevel denoising.
IMO we could differentiate denoising/despeckling. But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising
in a workflow step called despeckling
. So it would probably be easier to go with @bertsky 's last suggestion on raw and bilevel denoising and to actually define denoising in the glossary
But then the processors should be named accordingly. I would find it quite confusing to use a processor called
denoising
in a workflow step calleddespeckling
.
Absolutely. Since despeckling
was all we had, the current denoising processors all use that (in @comments
and tool json):
ocrd-cis-ocropy-denoise
, ocrd-cis-ocropy-binarize
ocrd-skimage-denoise
, ocrd-skimage-denoise-raw
We should open respective issues in those repos, and in the workflow guide of course.
And "denoise" is not introduced at all.
Then let's define it! Let's also differentiate between raw and bilevel denoising.
So how about:
AlternativeImage/@comments
classes normalized
and denoised
raw-denoised
, since we now require ordering anyway, so we should see things like denoised,binarized,denoised
)tool/steps
enum types preprocessing/optimization/normalization
(which is different from grayscale_normalization
), preprocessing/optimization/raw-denoising
(which is different from despeckling
) and preprocessing/optimization/binary-denoising
IMO there is a large, still unmet demand in OCR-D for image preprocessing tools to
Most binarization algorithms depend on this. For example, Sauvola (unless it exposes the R parameter and one can estimate a good fit from the image dynamics) assumes full dynamic range.
So how about adding the following:
fileGrp/@USE
name recommendationsOCR-D-IMG-NORM
andOCR-D-IMG-RAWDEN
AlternativeImage/@comments
classesnormalized
andraw-denoised
tool/steps
enum typespreprocessing/optimization/normalization
(which is different fromgrayscale_normalization
) andpreprocessing/optimization/raw-denoising
(which is different from binarydespeckling
)