Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png?

UB-Mannheim / zotero-ocr

Zotero Plugin for OCR

GNU Affero General Public License v3.0

552 stars 40 forks source link

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

Open T-Dane opened 3 weeks ago

T-Dane commented 3 weeks ago

Requesting a version of PDF OCR that only runs tesseract OCR on embedded images in PDF instead of capturing the whole page of the PDF.

A lot of my professors use powerpoints converted to PDF, the text is already text, while the screen-grabs they use lack this and could benefit from OCR.

I believe this could save time for others as well as not all PDF documents are purely images and often a combination.

aborel commented 3 weeks ago

Interesting idea, but inserting the OCRed text back into the existing text layer for hybrid pages might be challenging. I'm not familiar with ImageMapping, can you provide a link?

T-Dane commented 2 weeks ago

I completely trust it would be challenging, but it would make for an AMAZING feature! This: https://poppler.freedesktop.org/api/glib/poppler-Poppler-Page.html#PopplerImageMapping-struct

Or maybe this: https://world.pages.gitlab.gnome.org/Rust/poppler-rs/stable/0.24/docs/poppler/struct.ImageMapping.html

aborel commented 2 weeks ago

Thanks. We're currently looking into reducing the dependencies on external programs, so I'm not sure we'll use your suggestion, but we'll keep this in mind.