Plain text of pdf page - Githubissues

Hi @mits87 ,

Sorry for the long delay. Busy time.

There's no supported way to extract text from a PDF file using CombinePDF.

The reason is that the PDF format doesn't require the character map to map directly to text. It's just a list of numerical values (0-255) mapping to font glyphs. The letter a is as likely to be mapped using the value 0 as it is likely to be mapped using the value 92.

CombinePDF doesn't deconstruct the PDF to that resolution. Rather, it extracts the data maps and the fonts, but it doesn't concern itself with their content.

The content can be manually accessed and analyses using the CombinePDF object, but it's not something that's supported out of the box.

Good luck, Bo!

boazsegev / combine_pdf

Plain text of pdf page #157