bertsky / workflow-configuration

a makefilization for OCR-D workflows, with configuration examples
Apache License 2.0
9 stars 4 forks source link

ocrd-import: resolution when doing pdf conversion #3

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

When ocrd-import attempts to convert PDF input files, it will use the default pixel density of 72 DPI (since there is no native pixel density to vector graphics). This is insufficient for OCR. But the exact value required may depend on the use-case. So there should at least be a parameter (e.g. --render-dpi) what density to use for all vector graphics formats. One could then instruct IM to use convert -density $((2*$render_dpi)) input.pdf -resample $render_dpi output.png.

But what if the PDF contains raster graphics itself? They should not be re-rastered (esp. not by upsampling), but extracted raw (e.g. with pdfimages from poppler). But this would depend on whether these images are full-page (representative) or just embedded figures.

bertsky commented 4 years ago

Fixed by https://github.com/bertsky/workflow-configuration/commit/581332da2f478ad1b93b2d2141a515f885e5fbb1