Belval / pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
MIT License
1.59k stars 196 forks source link

Linux and Windows output is different #221

Open casper-hansen opened 2 years ago

casper-hansen commented 2 years ago

Hi @Belval

I have developed an application using the Windows OS, but now I want to deploy it on Linux. Unfortunately, no matter how I install poppler and pdf2image, I cannot get the same results across operating systems - and the Linux quality is worse for OCR, somehow.

For instance, I used to be able to capture the name "Jan Andersen" converting a PDF to PNG and running OCR. But on Linux, the output is instead "J an Andersen". If I instead save the PNG on Windows but run the OCR on Linux, I get the correct result "Jan Andersen". Therefore, I narrowed it down to the conversion stage.

What would be your recommendation to produce the exact same results on both systems - so I can uplift the accuracy on Linux?

Do I just have to accept that there are differences?

Solutions that I tried

  1. I have tried out #144 but to no avail. I have also made sure that the versions of poppler are exactly the same across operating systems by installing all my packages through conda.
  2. ImageMagick and GhostScript but conversion does not work well and output is in a data type not workable by OpenCV
Belval commented 2 years ago

Frankly this is a very puzzling issue. No as far as I know there should not be discrepancies between Windows and Linux provided that you use the same Poppler version. Did you ever find the root cause?

casper-hansen commented 2 years ago

Frankly this is a very puzzling issue. No as far as I know there should not be discrepancies between Windows and Linux provided that you use the same Poppler version. Did you ever find the root cause?

No, I never found the root cause. I assume this is a Poppler issue and not a pdf2image issue.

paul-tharun commented 2 years ago

@casperbh96 it is maybe because of difference in fonts installed between windows and linux. #201 looks similar to this.