OCR makes B&W PDF files too big

cyanfish / naps2

Scan documents to PDF and more, as simply as possible.

https://www.naps2.com

Other

2.58k stars 315 forks source link

OCR makes B&W PDF files too big #347

Open NextTherapist opened 3 months ago

NextTherapist commented 3 months ago

Describe the bug OCR makes B&W PDF files uncomprehensibly big.

To Reproduce Steps to reproduce the behavior:

B&W-scan a DIN A4 sheet single sided with some B&W text. A typical text/letter sheet with some lines of 12pt black text content. Then save it as PDF A-2b, without doing OCR before. You get a wonderful and small file of about 20-30 KB.
Take the same scan, activate OCR (german) and save it again as PDF A-2b. Now you get a file of about 70 KB, but the recognized text makes only about 3-4 KB (difference tested with other OCR Software)

So something in these OCRed files seems to be wrong. Perhaps the file is not saved CCITT compressed but in grey after OCR? I cannot control that.

Expected behavior The file made in 2. should have a maximum size of about 34 KB, not 70 KB.

Desktop (please complete the following information):

OS: Windows 10
Version: 7.4.0 32 bit

cyanfish commented 3 months ago

The extra size is from embedding the font used to render the text, which is required by the PDF-A standard.

NextTherapist commented 3 months ago

I made some tests:

NAPS2 embeds a font, when the file contains OCR, independent of the PDF version! Files with OCR contain a subset of Times New Roman, files without OCR do not, and also PDF/A-2b files without OCR do not.

It should not be necessary to embed a font just because there is an invisible OCR text layer in the file.

And of course it would not be necessary to embed a font just because the file is PDF/A. As long as the file content is only a raster image, no font is needed to trusty display the content, and so there is no reason to embed one. But as said, NAPS2 does this right: no font in the scanned and OCR-free PDF/A, the font comes from OCR.

cyanfish commented 3 months ago

Some OCR software uses a "fake" font instead of embedding a real font, but (a) that means the character measurements are off, which can cause alignment issues, and (b) that can cause various compatibility problems.

In theory it could be possible to provide an option to use a fake font like that, but I'm probably not going to do that.

NextTherapist commented 3 months ago

Now I tried to compare the OCR results of NAPS2 and PDF24, since both are based on Tesseract.

NAPS2 with OCR.pdf PDF24 with OCR.pdf

The PDF24 file is 65 KB smaller and to me it seems not to be less accurate in its alignment. It has "GlyphLessFont" embedded, which is perhaps what you meant.

NextTherapist commented 2 months ago

Perhaps the file sizes are bigger because NAPS2 uses PDFium for PDF generation instead of Ghostscript?

NextTherapist commented 2 months ago

I did a test with the "NAPS2 with OCR.pdf" file from above and optimized it with PDF XChange Editor, what mainly means it removed fonts. Result is a file of only 156 KB size, very similar to the PDF24 file. NAPS2.with.OCR_Optimized_A2b.pdf

I wanted to try if an embedded font is necessary at all for OCR, but yes, one font is still embedded: It's called "Untitled Truetype (CID) Identity-H" and the precision of OCR positions seems to be fine.

It would be great if NAPS2 could make such a small file by itself.