relax DPI metadata requirement for derived images

OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)

17 stars 5 forks source link

In mets.md, the following is stated ever since the very first version:

Every processing step that generates new images and changes their dimensions MUST make sure to adapt the density explicitly when serialising the image.
$> exiftool input.tif |grep 'X Resolution'
"300"

# WRONG (ppi unchanged)
$> convert input.tif -resize 50% output.tif

# RIGHT:
$> convert input.tif -resize 50% -density 150 -unit inches output.tif

$> exiftool output.tif |grep 'X Resolution'
"150"

However, because this seemingly simple requirement is very hard to abide by with PIL.Image in Python, in practise no single OCR-D processor has ever fulfilled it so far. The issue surfaced in OCR-D/core#343, where we agreed to rather write no density meta-data whatsoever than false default values.

So I think we should relax the MUST to a SHOULD. We can forgive ourselves this one if we can concede the same for others:

However, since technical metadata about pixel density is so often lost in conversion or inaccurate, processors should assume 300 ppi for images with missing or suspiciously low pixel density metadata.

Now, that backoff position may work for original image data, but it would be an unnecessary inconsistency for derived images: For derived images with missing density meta-data, processors should assume the same density they already assumed or believed for the original image.

Okay, after consulting with @wrznr I now believe that on the contrary, derived images must indeed be keeping DPI meta-data (whether or not these are to be trusted, or where they come from). So instead of relaxing the spec we should increase our efforts to get this right within core itself, and then make sure the modules really abide by it as well. Because PIL.Image does not keep this information across operations, this in practise entails:

memorizing the density tag of the original images when opening the input file
(when rescaling: adapting the density value as well)
re-applying the density tag when writing derived images
adding a corresponding error to the PAGE validator so we can fix transgressers

However, what still is wrong here is the missing distinction between original and derived images in the second paragraph:

However, since technical metadata about pixel density is so often lost in conversion or inaccurate, processors should assume 300 ppi for images with missing or suspiciously low pixel density metadata.

For derived images, if the PPI is missing, then at least there should be a fallback to the PPI of the original image. And if the PPI or the original image has been rejected, so should the values of the derived images.

OCR-D / spec

relax DPI metadata requirement for derived images #137