UB-Mannheim / zotero-ocr

Zotero Plugin for OCR
GNU Affero General Public License v3.0
551 stars 40 forks source link

PDFs very large, compressing in Preview removes OCR'd text #33

Closed e42mercury closed 2 years ago

e42mercury commented 3 years ago

I love this plug-in, but I am running into two issues:

  1. Scanned PDF size increase x10–20: I just scanned a 14 MB file and the scanned PDF was 200 MB.
  2. When I compress this file in Mac's Preview, the text is no longer searchable. It gets turned into this: "􏰀􏰀􏰀􏰀􏰀"

When I compress the file in with Acrobat's online tool, I can compress without losing the text, but overall it makes this plugin much less useful. The issue seems to be related to the tools the plug in relies on, but I couldn't find an easy solution when I tried googling poppler and tesseract.

Suggestions? thanks for the plugin!

zuphilip commented 2 years ago

Sorry for the delay @e42mercury . The plugin extracts each page in the PDF as an image (with help of pdftoppm) and then saves the resulting text together in a new PDF. This can lead to increasing file sizes, also your example (14MB to 200MB) seems quite a lot. I don't see at the moment, how to change something there. Because we want to have good images extracted for the OCR quality.

Maybe you can delete the large PDFs and just use the extracted text in the note?

e42mercury commented 2 years ago

No worries, that's a good suggestion. In the meantime, I have started using the command line program OCRmyPDF. This is a really nice, full-featured OCR program, which is also based on tesseract. Personally, I think the most useful thing for Zotero would be a plugin that runs OCRmyPDF – for something like that all that would be necessary is a dialogue window for all its many options.

zuphilip commented 2 years ago

Yeah, OCRmyPDF is also an interesting tool. However, I think the prerequirements for that to run on your local mashine are even larger than here (also Python and Ghostscript need to be installed). Thus, I don't want to change that here in the project, but if you like to work on that, then feel free to do that on a fork or just by using part of the plugin here.