Open ciur opened 3 years ago
I'm very excited for this future feature.
I spent hours trying to look for python libraries and other frameworks to be able to do this. I couldn't really find anything.
Some people where saying try to look for the about of whitespace or colour on the page and if less than 1-2% or so consider it blank. Whilst others were saying if the page size is low Kb consider it blank.
The day when I can scan in all my old files from the last 20 years and separate them with a blank page and automate filing will be a great day for my organization and a great day for recycling ;)
@browntownington
Some people where saying try to look for the about of whitespace or colour on the page and if less than 1-2% or so consider it blank.
Actually is way simpler than that. The trick is to detect blank pages not before OCR, but AFTER! The reason for that, is that when OCRing a blank page, the output text string will be empty i.e. no text on blank page :) The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)
The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)
Wouldn't that remove all pages with messy handwriting, pictures and and other hard to ocr contents?
The way I plan to "automatically delete blank pages" is simply deleting pages with successfully completed OCR but with no text extracted :) In technical terms, a page will be marked as blank page if after successful OCR this model field is empty :) - as simple as that :)
Wouldn't that remove all pages with messy handwriting, pictures and and other hard to ocr contents?
It actually works quite well. But it can work also work a little better with a combination. I made a blank page detection system that worked fairly well all things considered. It was for my personal Document Manager. I could share the code if this hasn't been solved through Automates. I think with a little modifying it would work fine with the current codebase.
@ciur
Post from reddit:
which brings up a common scenario. I created this issue to keep track of this valid use case and implemented in future releases.