ciur / papermerge

Open Source Document Management System for Digital Archives (Scanned Documents)
https://papermerge.com
Apache License 2.0
2.55k stars 267 forks source link

Automatic Blank Page Deletion. #405

Open ciur opened 3 years ago

ciur commented 3 years ago

Post from reddit:

Hello,

my scanner doesn't have the best blank sites detection,
I was wondering if anyone has an idea to automatically delete blank pages in Papermerge or through
a script before it get's imported?

which brings up a common scenario. I created this issue to keep track of this valid use case and implemented in future releases.

browntownington commented 3 years ago

I'm very excited for this future feature.

I spent hours trying to look for python libraries and other frameworks to be able to do this. I couldn't really find anything.

Some people where saying try to look for the about of whitespace or colour on the page and if less than 1-2% or so consider it blank. Whilst others were saying if the page size is low Kb consider it blank.

The day when I can scan in all my old files from the last 20 years and separate them with a blank page and automate filing will be a great day for my organization and a great day for recycling ;)

ciur commented 3 years ago

@browntownington

Some people where saying try to look for the about of whitespace or colour on the page and if less than 1-2% or so consider it blank.

Actually is way simpler than that. The trick is to detect blank pages not before OCR, but AFTER! The reason for that, is that when OCRing a blank page, the output text string will be empty i.e. no text on blank page :) The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)

Myrc commented 3 years ago

The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)

Wouldn't that remove all pages with messy handwriting, pictures and and other hard to ocr contents?

jacz24 commented 2 years ago

The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)

Wouldn't that remove all pages with messy handwriting, pictures and and other hard to ocr contents?

It actually works quite well. But it can work also work a little better with a combination. I made a blank page detection system that worked fairly well all things considered. It was for my personal Document Manager. I could share the code if this hasn't been solved through Automates. I think with a little modifying it would work fine with the current codebase.

@ciur