UB-Mannheim / ocrd_pagetopdf

OCR-D wrapper for prima-pagetopdf
Apache License 2.0
7 stars 5 forks source link
ocr ocr-d prima-pagetopdf

ocrd-pagetopdf

OCR-D wrapper for prima-page-to-pdf

Transforms all PAGE-XML+IMG to PDF with text layer and (optionally) polygon outlines.

(Converts original images together with text and layout annotations of all pages in the PAGE input file group to PDF. The text is rendered as an overlay.)

Requirements

Installation

Once you have installed Java, make, Python, and set up your virtual environment, do:

make deps # or: pip install ocrd
make install # copies into PREFIX or VIRTUAL_ENV

Usage

The command-line interface conforms to OCR-D processor specifications.

Assuming you have an OCR-D workspace in your current working directory, simply do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word"}'

This will run the script and create PDF files for each page with a text layer based on word-level annotations.

There is also an option to create an additional multipage file with name merged.pdf, which contain all single pages in correct order:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged"}'

FAQ