Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.51k stars 187 forks source link

convert_from_path returns non-ascii characters in some pages of a pdf #134

Open venkat-amballa opened 4 years ago

venkat-amballa commented 4 years ago

Problem: when i tried to split a pdf into multiple pages, i found that in some of the pages data is corrupted. i.e, Though i am able to see corresponding page content clearly using chrome pdf viewer. But the page's output given by convert_from_path looks corruped as shown below.

Due to some sensitive content i cant share the complete pdf

Screenshots This is page 21 of the pdf: content_page_21 This is the individual page 21: Which is the output from ''convert_from_path'' garbage_page_21

Desktop:

Additional context

Belval commented 4 years ago

Unfortunately, this is probably a by-product of pdfium's (chromium pdf engine) very very "soft" handling of the PDF specifications. I am afraid that unless you can parse the document with pdftoppm -r 200 your_pdf.pdf out I cannot help you.