Open bertsky opened 4 years ago
Okay, after consulting with @wrznr I now believe that on the contrary, derived images must indeed be keeping DPI meta-data (whether or not these are to be trusted, or where they come from). So instead of relaxing the spec we should increase our efforts to get this right within core itself, and then make sure the modules really abide by it as well. Because PIL.Image does not keep this information across operations, this in practise entails:
However, what still is wrong here is the missing distinction between original and derived images in the second paragraph:
However, since technical metadata about pixel density is so often lost in conversion or inaccurate, processors should assume 300 ppi for images with missing or suspiciously low pixel density metadata.
For derived images, if the PPI is missing, then at least there should be a fallback to the PPI of the original image. And if the PPI or the original image has been rejected, so should the values of the derived images.
In mets.md, the following is stated ever since the very first version:
However, because this seemingly simple requirement is very hard to abide by with
PIL.Image
in Python, in practise no single OCR-D processor has ever fulfilled it so far. The issue surfaced in OCR-D/core#343, where we agreed to rather write no density meta-data whatsoever than false default values.So I think we should relax the MUST to a SHOULD. We can forgive ourselves this one if we can concede the same for others:
Now, that backoff position may work for original image data, but it would be an unnecessary inconsistency for derived images: For derived images with missing density meta-data, processors should assume the same density they already assumed or believed for the original image.