DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.
Other
1.72k stars 384 forks source link

Identifying and removing asterisks #736

Open Elias-Est opened 4 months ago

Elias-Est commented 4 months ago

Hi,

I'm currently trying to remove asterisks from a scanned document because they disturb some OCR operations of the software I'm working on (see the images below).

The approach I tried is to convert the image to binary and find the asterisks via pixHMT so I can remove the area around them. However, I found out that the asterisks vary too much to easily define a reliable pattern: The Sels that I created are either too strict or too generic, leading to a lot of false positives and/or negatives.

Now I'm wondering if it even makes sense to continue with this approach or if there are are better ways to identify (and remove) the asterisks? Or are there maybe some tips you can give me that make it easier to find a suitable pattern?

Thank you

Elias


This is for example a (small) part of a document with unwandted asterisks: asterisks

And the binary version: asterisks - binary

DanBloomberg commented 4 months ago

Those asterisks seem regular enough to identify with a good HM sel (structuring element). Can you show me the HM sel that you used? For an example of how to generate these HM sels, be sure to see prog/livre_hmt.c and run it with the parameters that were used for the figures:

   livre_hmt 1 8
   livre_hmt 2 4
DanBloomberg commented 4 months ago

Also, look at prog/findpattern_reg.c

DanBloomberg commented 4 months ago

Also, if the asterisks always come in strings of at least 3, I would use a string of at least 3 of them for the HM sel. That will make it more robust against false positives.

Then after the HM transform, dilate by the image that was used to make the HM sel. Then you have 2 choices to remove the asterisks from the input image: (1) do a small dilation on the result and then subtract it from the input image (2) use the result as a seed, fill with clipping to the input image, and then subtract that from the input image.

Elias-Est commented 4 months ago

Hi @DanBloomberg,

thanks for your response. I looked at the classes you mentioned and how they use the "pixGenerateSelBoundary" method. Using that information, I was able to create a Sel with which I can detect the asterisks quite reliably now. Maybe I'll do some fine-tuning for even better results, but for now, they are sufficient.

Originally, I tried to find the common parts of the asterisks by hand and created Sels with "selCreateFromString", which probably was a bit too optimistic.

Unfortunately, some documents contain single asterisks so using strings of them is not an option.

For removing the found asterisks from the document image, I tried your first suggestion (I'm using pixDilateBrick to dilate), and it works fine, too.

Thank you very much for your help and have a nice weekend.

DanBloomberg commented 4 months ago

Glad it worked out! I wrote those functions for creating HM sels from a bitmap because it's much easier than making them by hand, one hit or miss at a time.

Were you able to make a HM sel for a single asterisk that didn't produce many false positives? The asterisk has a distinctive shape, so my guess is that it is possible. Note that a horizontal line under an asterisk can cause trouble it your sel has a 'miss' too far below it, so when you generate the sel with ``pixGenerateSelBoundary(), usebotflag = 0``` so that you don't add extra background below the template.

Elias-Est commented 4 months ago

To improve the detection rate while keeping the false positives to a minimum, I'm actually using two sels now which I created by using pixGenerateSelBoundary with strict values. This works better than using a single more loose one.

Regarding the horizontal line: Removing the asterisks above it works the way you describe. However, I've noticed that the line causes an even larger problem: It makes the numbers and the "EUR" above it unrecognizable for our OCR engine. I tried to remove it using the procedure described in http://www.leptonica.org/line-removal.html, but unfortunately, the lower parts of the characters are fused with the horizontal line so removing the line only leaves unrecognizable leftovers. Do you have any idea what I could do or is this a problem that can't be solved via image processing?

Without dilation

above-line

With dilation (2, 2)

above-line-dilate
DanBloomberg commented 4 months ago

Not easy.

However, there is something you can try for this particular problem. (1) Find the bounding boxes of connected components, using pixConnCompBB() (2) Extend each box down about 6 pixels to include the line below each, using boxaAdjustBoxSizes() (3) Use the boxa of expanded boxes to extract from the original image. One way to do this:

      pix1 = pixCreateTemplate(pixs);      [pixs is original image]
      pixMaskBoxa(pix1, pix1, boxa, L_SET_PIXELS);
      pixAnd(pix1, pix1, pixs);

or, alternatively,

     pixa1 = pixClipRectangles(pixs, boxa);
     pixGetDimensions(pixs, &w, &h, NULL);
     pix1 = pixaDisplay(pixa1, w, h);
     pixaDestroy(&pixa1);

(Because this operation (3) isn't obvious, and it should be easy, I'll add a function that does it)

This unfortunately puts the line under the "R", changing it to a "B". It might also make some numbers unrecognizable, like "4" and "7".

Elias-Est commented 4 months ago

Sorry for the late reply, I now was finally able to try out what you suggested.

As you already mentioned, this approach improves the readability of some characters but decreases it for others. Unfortunately, as I noticed, the latter seems to be more often the case than the former.

However, in most of the documents, the line is either clearly separated from the text above or the overlap is small enough to still have recognizable characters after removing the line. Therefore, I decided to just remove the line and put up with the occasional failures if too much of the characters above is cut off.

Still, it was worth a try; thank you again for your help and have a nice weekend.