Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.64k stars 195 forks source link

Text spacing (horizontal) decreases/changes due to pdf2image #191

Open muglikar opened 3 years ago

muglikar commented 3 years ago

Original PDF snapshot: image (1) Converted PNG snapshot: image (2)

Code used:

dpi = 300 # dots per inch
pages = convert_from_path(PDF_FileName, dpi)
for i in range(len(pages)):
   page = pages[i]
   page.save('output_{}.png'.format(i), 'PNG')

Windows 10 64-bit machine.

Text spacing (horizontal) decreases/changes due to pdf2image -> Could have troubles with OCR from these PNGs.

muglikar commented 3 years ago

I this error in anyway related to these displayed errors in terminal?

Syntax Error: No display font for 'ArialUnicode'
Syntax Error: Couldn't find a font for 'MyriadPro-Regular', subst is 'Helvetica'
Syntax Error: Couldn't find a font for 'MyriadPro-Bold', subst is 'Helvetica'
Syntax Error: Couldn't find a font for 'MyriadPro-It', subst is 'Helvetica'
Belval commented 3 years ago

Hi! Sorry for the late reply.

Can you try to convert your PDF using pdftoppm? If the issue is there as well then I am afraid I can't fix it on my side, as it is an issue with how poppler renders the PDF.

muglikar commented 3 years ago

Hi @Belval , Will try and let you know.

neerajhbhat commented 3 years ago

Hi @muglikar Were you able to solve this?

vcjayan commented 2 years ago

Hi @muglikar , could you update if you resolved this issue ? I am facing similar issue