4lex4 / scantailor-advanced

ScanTailor Advanced is the version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes.
GNU General Public License v3.0
1.19k stars 129 forks source link

Despeckle by area size #70

Open Piolie opened 5 years ago

Piolie commented 5 years ago

The current despeckle algorithm works well most of the time. However I have seen that it fails even for tiny particles if they are very near the rest of the content (for example, in between text lines). Rising the Despeckle level does not improve the result. On the contrary, the algorithm starts eating away the dots over the letters i or the full stops.

I think it would be nice to have the option to erase all black/white areas that have a pixel count bellow a settable threshold.

Currently this can be achieved by applying ImageMagick's connected-component labeling on the output of ScanTailor. The license is compatible with the GPL, so maybe it is easy to implement here.

4lex4 commented 5 years ago

I don't need that implememtation, as ST already have the connected componets labeling implementation and uses it internally.

I'll just add an option to despeckle named threshold: all the components with size lower than the threshold value will be removed no matter where they are placed.

Mister-Teatime commented 5 years ago

Not sure if the implementation would also allow for the following, which I would also find useful in this context:

Remove components thinner than a certain number of pixels. so e.g. a hair could be removed even if it produce a long structure and covers more pixels than a printed dot, as long as it is thinner than any printed line.

As I said, don't know if the maths for it is already implemented, but it could be done based on number of pixels within the structure per pixel on the edge (1 for a single-pixel line, 2 for 2-pixel lines, etc.), or on distance of "inner" pixels from the edge. Or maybe there's a smarter algorithm in either ImageMagick or ImageJ.

4lex4 commented 4 years ago

I already use the algorithm you described in the noise reduction of the color segmentation for removing long thin components. Yes, I think of implementing the new option in this way.