aws-samples / amazon-textract-searchable-pdf

Generate searchable pdf documents from scanned documents with Amazon Textract
Other
65 stars 27 forks source link

Original PDF object is being altered beyond adding an OCR layer. #13

Open DarkPhyber-hg opened 4 months ago

DarkPhyber-hg commented 4 months ago

I have not run the code, i just looked at the sample input and output files provided in the readme on the front page of this git.

Why does the input file go from 82k to 643k? Adding an OCR layer should not cause the file size to increase by almost 800%!

Taking a closer look the file itself is being altered, which i feel is unacceptable. All that should happen when creating a searchable pdf is adding a transparent text layer to the original pdf.

root@debian-test:~# pdfimages -list SampleInput.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1267   793  rgb     3   8  image  no         6  0    72    72 80.3K 2.7%
   1     1 smask    1267   793  gray    1   8  image  no         6  0    72    72  996B 0.1%
root@debian-test:~# pdfimages -list SampleOutput.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    5279  3304  rgb     3   8  jpeg   no         6  0    72    72  642K 1.3%