When ocrd-import attempts to convert PDF input files, it will use the default pixel density of 72 DPI (since there is no native pixel density to vector graphics). This is insufficient for OCR. But the exact value required may depend on the use-case. So there should at least be a parameter (e.g. --render-dpi) what density to use for all vector graphics formats. One could then instruct IM to use convert -density $((2*$render_dpi)) input.pdf -resample $render_dpi output.png.
But what if the PDF contains raster graphics itself? They should not be re-rastered (esp. not by upsampling), but extracted raw (e.g. with pdfimages from poppler). But this would depend on whether these images are full-page (representative) or just embedded figures.
When ocrd-import attempts to
convert
PDF input files, it will use the default pixel density of 72 DPI (since there is no native pixel density to vector graphics). This is insufficient for OCR. But the exact value required may depend on the use-case. So there should at least be a parameter (e.g.--render-dpi
) what density to use for all vector graphics formats. One could then instruct IM to useconvert -density $((2*$render_dpi)) input.pdf -resample $render_dpi output.png
.But what if the PDF contains raster graphics itself? They should not be re-rastered (esp. not by upsampling), but extracted raw (e.g. with
pdfimages
from poppler). But this would depend on whether these images are full-page (representative) or just embedded figures.