Open T-Dane opened 3 weeks ago
Interesting idea, but inserting the OCRed text back into the existing text layer for hybrid pages might be challenging. I'm not familiar with ImageMapping, can you provide a link?
I completely trust it would be challenging, but it would make for an AMAZING feature! This: https://poppler.freedesktop.org/api/glib/poppler-Poppler-Page.html#PopplerImageMapping-struct
Or maybe this: https://world.pages.gitlab.gnome.org/Rust/poppler-rs/stable/0.24/docs/poppler/struct.ImageMapping.html
Thanks. We're currently looking into reducing the dependencies on external programs, so I'm not sure we'll use your suggestion, but we'll keep this in mind.
Requesting a version of PDF OCR that only runs tesseract OCR on embedded images in PDF instead of capturing the whole page of the PDF.
A lot of my professors use powerpoints converted to PDF, the text is already text, while the screen-grabs they use lack this and could benefit from OCR.
I believe this could save time for others as well as not all PDF documents are purely images and often a combination.