fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

a bit off topic: pdf/a after merging #108

Closed femifrak closed 9 years ago

femifrak commented 9 years ago

I like ocrmypdf very much. But i have one concern when merging several pdfs generated by ocrmypdf. Each single pdf is in pdf/a format, indicated by acrobat reader with a blue bar at the top.

For merging i use either pdftk pdf cat output all.pdf or gs -dBATCH -dNOPAUSE -sPAPERSIZE=A4 -sDEVICE=pdfwrite -sOutputFile="all.pdf" $(ls .pdf)

Is the merged pdf still in pdf/a format? acrobat does not show the blue bar any more.

The blue bar also disappears when i change the meta data: pdftk in.pdf dump_data output info.txt edit info.txt pdftk in.pdf update_info info.txt output out.pdf

Although in.pdf was pdf/a, i don't know whether out.pdf is follows the pdf/a convention.

Can someone give me a hint on that? And if its not pdf/a, how can i transform it to pdf/a?

Thanks a lot,

Femi

jbarlow83 commented 9 years ago

No. PDFTK does not preserve PDF/A status. You can use Ghostscript with the -dPDFA (?) switch to merge and to create PDF/A. The options to get a PDFA are very fussy and Ghostscript's error messages are obtuse. Use the exact same command line ocrmypdf does, with the same order.

I believe to get a PDFA with metadata embedded you can use PDFTK to add metadata then Ghostscript to get the PDFA - but you might have to write a little postscript stub file that contains the metadata segment and merge it with the PDFA. The development version of ocrmypdf sort of does this right now. On Sat, May 30, 2015 at 03:01 femifrak notifications@github.com wrote:

I like ocrmypdf very much. But i have one concern when merging several pdfs generated by ocrmypdf. Each single pdf is in pdf/a format, indicated by acrobat reader with a blue bar at the top.

For merging i use either pdftk pdf cat output all.pdf or gs -dBATCH -dNOPAUSE -sPAPERSIZE=A4 -sDEVICE=pdfwrite -sOutputFile="all.pdf" $(ls .pdf)

Is the merged pdf still in pdf/a format? acrobat does not show the blue bar any more.

The blue bar also disappears when i change the meta data: pdftk in.pdf dump_data output info.txt edit info.txt pdftk in.pdf update_info info.txt output out.pdf

Although in.pdf was pdf/a, i don't know whether out.pdf is follows the pdf/a convention.

Can someone give me a hint on that? And if its not pdf/a, how can i transform it to pdf/a?

Thanks a lot,

Femi

— Reply to this email directly or view it on GitHub https://github.com/fritz-hh/OCRmyPDF/issues/108.

jbarlow83 commented 9 years ago

This problem with metadata being dropped by OCRmyPDF has been fixed in v3.0-rc2.