Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.6k stars 194 forks source link

Some missing words from converting PDF to Image #282

Open jason-ng-zq99 opened 5 months ago

jason-ng-zq99 commented 5 months ago

Hi, I am currently encountering the titled issue when using the convert_from_bytes function.

On my Mac, this happens specifically if I open up a fillable pdf and fill in with the preview function Words that are filled in this way do not get converted. Screenshot 2024-04-08 at 20 58 08

Screenshot 2024-04-08 at 21 00 40

If i use strict=True, and also when i test out with the pdftoppm -r 200 -jpeg sample_pdf.pdf out command on my terminal, I get the following error message:

Syntax Error: Unknown font tag 'ArialMT'
Syntax Error: Unknown font tag 'ArialMT'
Syntax Error (69): No font in show

I have also gotten Unknown font tag 'Helvetica' on other files.

I have also verified that these fonts are present in my system using the fc-match ArialMT command, which returns me the respective matched font, in this case it'sVerdana.ttf: "Verdana" "Regular"

Interestingly, texts that are filled in via the textbox function remains converted as seen below: Screenshot 2024-04-08 at 21 03 48 Screenshot 2024-04-08 at 21 03 59

This problem was first found on my Debian GNU/Linux 11 docker, and has the exact same behavior.

I have also already tried installing fonts like fonts-freefont-ttf fonts-liberation fonts-liberation2 ttf-mscorefonts-installer but the same issue persists.

P.S. Suspecting it might be an issue with editable fields, I also tried to flatten the pdf first using fillpdf before using convert_from_path, but the same issue remains.

Problem replicated on two systems:

Thanks in advance!