coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
GNU Affero General Public License v3.0
591 stars 42 forks source link

Feature proposition: list text objects, their size and location found in a pdf file #92

Open d-ph opened 1 month ago

d-ph commented 1 month ago

Hello,

Similar to how cpdf can list images with the -image-resolution operation, would it be possible to add a cpdf operation that lists text object (most importantly: their size and location) found in a pdf?

The caveat being that "text that has been converted to vector outlines" would not be detected by that new cpdf operation, which is understandable.

Regards.

johnwhitington commented 3 weeks ago

There are two tasks here:

1) Parse PDF page content to locate objects on the page; and 2) Do PDF text extraction.

The first will be coming soon. The second will happen, but only for well-behaved modern PDFs. I don't want to get into the full field of PDF text extraction - it's a complex thing.

d-ph commented 3 weeks ago

Understood and fair. Thanks for the information and explanation 👍