boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
734 stars 156 forks source link

Plain text of pdf page #157

Closed mits87 closed 5 years ago

mits87 commented 5 years ago

Hello,

I have a question how can I get a plain text from single page?

pages = CombinePDF.parse(s3_object.body.read).pages
pages.each_with_index do |page, index|
   # I would like to see plain text of single PDF page 
end

I couldn't any solution inside the source. Thx

boazsegev commented 5 years ago

Hi @mits87 ,

Sorry for the long delay. Busy time.

There's no supported way to extract text from a PDF file using CombinePDF.

The reason is that the PDF format doesn't require the character map to map directly to text. It's just a list of numerical values (0-255) mapping to font glyphs. The letter a is as likely to be mapped using the value 0 as it is likely to be mapped using the value 92.

CombinePDF doesn't deconstruct the PDF to that resolution. Rather, it extracts the data maps and the fonts, but it doesn't concern itself with their content.

The content can be manually accessed and analyses using the CombinePDF object, but it's not something that's supported out of the box.

Good luck, Bo!