OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

save_image_file should set DPI for derived images #343

Open bertsky opened 4 years ago

bertsky commented 4 years ago

Currently, any information on image resolution provided in the original image (and made available via OcrdExif in Workspace.image_from_page) is ignored when saving derived images in the workspace (via Workspace.save_image_file). Due to PIL.Image format internals, the PNG then contains a setting of 72 DPI however. This might create problems for processors that look at the derived image files alone.

But this is hard to fix in core: the image passed to save_image_file could come from anywhere (and usually does not have an info['dpi']; even simple PIL.Image operations omit that in the result).

Realistically though, it will have been created some way from the source image file under the same pageId, and since rescaling is currently not permitted in the spec, one could assume the same DPI for all derived images.

wrznr commented 4 years ago

and usually does not have an info['dpi']

Why can't we check whether such information is available? Copy it, if that's the case and delete the default value if not? We should definitely not write default values which we know to be wrong (I added the label bug).

bertsky commented 4 years ago

and usually does not have an info['dpi']

Why can't we check whether such information is available?

We could, but as I said: usually it is not. Or should I say: almost always. Any non-mutating PIL.Image operation (even those that wouldn't affect the meta-data) omits info from the result. We cannot change that. And we cannot force processors to give us much more than what PIL.Image does.

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

bertsky commented 4 years ago

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

Oh, just found that this appears to have been fixed in Pillow 6.2.1. Since we cannot likely get any better, I will close for now.

kba commented 4 years ago

https://github.com/python-pillow/Pillow/blob/master/CHANGES.rst#621-2019-10-21

Should we update to 6.2.1 then? Do you have a sample for me to test? Thanks!

bertsky commented 4 years ago

Sorry, I was imprecise: It could have been fixed in an earlier version already.

I saw the following with 5.4.1 (leading up to this issue):

python -c "import PIL.Image; PIL.Image.open('repo/data/assets/scribo-test/data/OCR-D-IMG/OCR-D-IMG-orig_tiff.tif').save('test.png')"
identify -verbose test.png | grep Resolution:
 Resolution: 72x72

However, this does not happen any more with 6.2.1 (where correctly no DPI is saved).

bertsky commented 4 years ago

@wrznr just convinced me that we should indeed take action to ensure core and modules comply with the spec (which requires PPI information to be kept for derived images, cf. OCR-D/spec#137).

bertsky commented 4 years ago

Not so sure about the time frame for this though. Since it involves patching all modules, I guess the final workshop is out of the question. Setting 3.0 to take out the pressure.

bertsky commented 4 years ago

Since it involves patching all modules

If we just assume the derived image passed to save_image_file must come from the (cropped/deskewed/dewarped/binarized/denoised/...) original without rescaling somehow, then we'll only need to patch core (finding the original image with the same pageId and looking at its OcrdExif). Of course we could still allow processors to specify a DPI themselves (in case they did rescale, or just to speed up).

bertsky commented 2 months ago

If we just assume the derived image passed to save_image_file must come from the (cropped/deskewed/dewarped/binarized/denoised/...) original without rescaling somehow, then we'll only need to patch core (finding the original image with the same pageId and looking at its OcrdExif). Of course we could still allow processors to specify a DPI themselves (in case they did rescale, or just to speed up).

Alternatively, we could inject DPI info into the coords dict (along with affine transform and image features) at the top image_from_page and then make sure it gets passed down with each image_from_segment. Since images are only ever useful along with their coordinates, and image operations usually have to update that dict consistently (for example when rescaling), any caller of Workspace.save_image_file should have the DPI info nearby.