OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.
Other
3.61k
stars
597
forks
source link
How useful is OpenPDF while "curating" pdf files? #1234
I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.
Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:
// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin
Producer: Recoded by LuraDocument PDF v2.16
CreationDate: Wed Mar 21 00:12:44 2007 CST
ModDate: Wed Mar 21 00:14:02 2007 CST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 628
Encrypted: no
Page size: 527 x 802 pts
Page rot: 0
File size: 80891203 bytes
Optimized: no
PDF version: 1.5
$
How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?
If not with OpenPDF which utility would you suggest?
I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.
Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:
// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin
https://archive.org/details/tractatusdesigni00johnrich/
https://archive.org/download/tractatusdesigni00johnrich/tractatusdesigni00johnrich.pdf ~ $ date; ifl="tractatusdesigni00johnrich.pdf"; ls -l "${ifl}"; file --brief "${ifl}"; sha256sum - -binary "${ifl}"; pdfinfo "${ifl}"
Mon 18 Nov 2024 03:33:51 AM CST
-rwxrwxrwx 1 user user 80891203 Nov 16 17:20 tractatusdesigni00johnrich.pdf
PDF document, version 1.5
5a55fba506e750a602057ba99ae202c26b24503b00d58dd53d6f50ea0e6722b8 *tractatusdesigni00johnrich.pdf
Producer: Recoded by LuraDocument PDF v2.16 CreationDate: Wed Mar 21 00:12:44 2007 CST ModDate: Wed Mar 21 00:14:02 2007 CST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 628 Encrypted: no Page size: 527 x 802 pts Page rot: 0 File size: 80891203 bytes Optimized: no PDF version: 1.5 $
How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?
If not with OpenPDF which utility would you suggest?
lbrtchx