LibrePDF / OpenPDF

OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.
Other
3.61k stars 597 forks source link

How useful is OpenPDF while "curating" pdf files? #1234

Open Albretch opened 1 week ago

Albretch commented 1 week ago

I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.

Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:

// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin

https://archive.org/details/tractatusdesigni00johnrich/

https://archive.org/download/tractatusdesigni00johnrich/tractatusdesigni00johnrich.pdf ~ $ date; ifl="tractatusdesigni00johnrich.pdf"; ls -l "${ifl}"; file --brief "${ifl}"; sha256sum - -binary "${ifl}"; pdfinfo "${ifl}"

Mon 18 Nov 2024 03:33:51 AM CST

-rwxrwxrwx 1 user user 80891203 Nov 16 17:20 tractatusdesigni00johnrich.pdf

PDF document, version 1.5

5a55fba506e750a602057ba99ae202c26b24503b00d58dd53d6f50ea0e6722b8 *tractatusdesigni00johnrich.pdf

Producer: Recoded by LuraDocument PDF v2.16 CreationDate: Wed Mar 21 00:12:44 2007 CST ModDate: Wed Mar 21 00:14:02 2007 CST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 628 Encrypted: no Page size: 527 x 802 pts Page rot: 0 File size: 80891203 bytes Optimized: no PDF version: 1.5 $

How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?

If not with OpenPDF which utility would you suggest?

lbrtchx