Enhance image before sending to Tesseract for better OCR accuracy

0xbad1d3a5 commented 7 years ago

@Shimizoki graciously sent over some otsu thresholding code in Java that should clean up the image a little:

    @Override
    public void run() {
                // ...
                long startTimeBitmap = System.currentTimeMillis();
                Bitmap gray = ColorToGrayscale(mBitmap);
                mBitmap = OtsuThreshold(gray).copy(Bitmap.Config.ARGB_8888, true);
                Log.d(TAG, String.format("Bitmap processing took: %d", System.currentTimeMillis() - startTimeBitmap));
    }

    public static Bitmap ColorToGrayscale(Bitmap bm) {
        Bitmap grayScale = Bitmap.createBitmap(bm.getWidth(), bm.getHeight(), Bitmap.Config.RGB_565);

        ColorMatrix cm = new ColorMatrix();
        cm.setSaturation(0);

        Paint p = new Paint();
        p.setColorFilter(new ColorMatrixColorFilter(cm));

        new Canvas(grayScale).drawBitmap(bm, 0, 0, p);

        return grayScale;
    }

    public static Bitmap GrayscaleToBin(Bitmap bm, int threshold) {
        Bitmap bin = Bitmap.createBitmap(bm.getWidth(), bm.getHeight(), Bitmap.Config.RGB_565);

        ColorMatrix cm = new ColorMatrix(new float[] {
                85.f, 85.f, 85.f, 0.f, -255.f * threshold,
                85.f, 85.f, 85.f, 0.f, -255.f * threshold,
                85.f, 85.f, 85.f, 0.f, -255.f * threshold,
                0f, 0f, 0f, 1f, 0f
        });

        Paint p = new Paint();
        p.setColorFilter(new ColorMatrixColorFilter(cm));

        new Canvas(bin).drawBitmap(bm, 0, 0, p);

        return bin;
    }

    public static Bitmap OtsuThreshold (Bitmap bm) {

        // Get Histogram
        int[] histogram = new int[256];
        for(int i = 0; i < histogram.length; i++) histogram[i] = 0;

        for(int i = 0; i < bm.getWidth(); i++) {
            for(int j = 0; j < bm.getHeight(); j++) {
                histogram[(bm.getPixel(i, j) & 0xFF0000) >> 16]++;
            }
        }

        // Get binary threshold using Otsu's method

        int total = bm.getHeight() * bm.getWidth();

        float sum = 0;
        for(int i = 0; i < 256; i++) sum += i * histogram[i];

        float sumB = 0;
        int wB = 0;
        int wF = 0;

        float varMax = 0;
        int threshold = 0;

        for(int i = 0 ; i < 256 ; i++) {
            wB += histogram[i];
            if(wB == 0) continue;
            wF = total - wB;

            if(wF == 0) break;

            sumB += (float) (i * histogram[i]);
            float mB = sumB / wB;
            float mF = (sum - sumB) / wF;

            float varBetween = (float)wB * (float)wF * (mB - mF) * (mB - mF);

            if(varBetween > varMax) {
                varMax = varBetween;
                threshold = i;
            }
        }

        return GrayscaleToBin(bm, threshold);
    }

I didn't have time to test for accuracy, but here's performance roughly (Axon 7):

D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: X:37 Y:1287 (1340x155)
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Image Dimensions: 1440x2560
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: /storage/emulated/0/Android/data/ca.fuwafuwa.kaku/files/screenshots/screen 865587004995899.png
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Bitmap processing took: 221
I/art: Do partial code cache collection, code=112KB, data=100KB
I/art: After code cache collection, code=112KB, data=100KB
I/art: Increasing code cache capacity to 512KB
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: THREAD STARTING NEW LOOP
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: WAITING
E/ca.fuwafuwa.kaku.MainServiceHandler: `[01/10/2017]でれたのしいができる
                                       Screenshot Time: 44
                                       OcrTime: 3549

D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: X:26 Y:858 (1379x1121)
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Image Dimensions: 1440x2560
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: /storage/emulated/0/Android/data/ca.fuwafuwa.kaku/files/screenshots/screen 865598380184636.png
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Bitmap processing took: 1400
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: THREAD STARTING NEW LOOP
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: WAITING
E/ca.fuwafuwa.kaku.MainServiceHandler: フィリピンでは19、のマラでのおりがあります。いキリストのをせたがのをります。このにると、がるなどの「がある」とわれています。によると、のおりには150ぐらいがしました。のが、にるためにの〈へこうとしました。にることができないためヽのにいるにオルをげるもいました。げたオルでにってもらってヽそのオルをのなどにします。フィリピンのによるとヽこのおりで10O0ぐらいがけがをしたりヽさでが〈なったりしました。
                                       Screenshot Time: 50
                                       OcrTime: 17896

D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: X:26 Y:858 (1389x277)
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Image Dimensions: 1440x2560
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: /storage/emulated/0/Android/data/ca.fuwafuwa.kaku/files/screenshots/screen 865653288799533.png
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: Bitmap processing took: 354
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: THREAD STARTING NEW LOOP
D/ca.fuwafuwa.kaku.Ocr.OcrRunnable: WAITING
E/ca.fuwafuwa.kaku.MainServiceHandler: フィリピンでは19、のマラでのおりがあります。いキリストのをせたがのをります。このにると、がるなどの「がある」とわれています。
                                       Screenshot Time: 40
                                       OcrTime: 4826

For now, I don't think I'll be committing this code though - rather than a quick fix now I'd rather wait until a good image processing pipeline can be designed and implemented. It seems like Tesseract already uses the Otsu method for binarizing images so I don't think this method would be the most effective way to process the capture either.

Leptonica, which is also in the tess-two library also seems to have some nice image processing functions we might be able to take advantage of (also runs in C so should be faster than a Java implementation).

https://groups.google.com/forum/#!topic/tesseract-ocr/JRwIz3xL45U

Shimizoki commented 7 years ago

Tesseract does seem to do otsu binarization (we are pretending this is a word) on the input image, but as far as I can tell that is all it does.

From simple test cases on the sample image in #5 performing the above code decreased the amount of false positives on noisy backgrounds. This would imply that either it is not performing the pre-processing (it is accidentally turned off), or it does not do it as well. The above code is a stopgap to increase efficiency for the data I was using. Though ideally, as you said, a full pipeline should be implemented.

For cleaning up an image you often times want to:

De-skew (to make the text horizontal)
Convert to Grayscale
De-noise (Sometimes this is a guassian blur, others a shrink and expand)
Convert to Binary (Adaptive / Local is the way to go)
Color Correctness (Make sure the image is black on white for easier parsing)

Depending on what the purpose of the cleanup is... you may swap the order around, or run some steps multiple times. For something like Japanese which has many small dots and dashes as part of the language... it may be difficult to de-noise without first identifying potential characters and removing outliers.

0xbad1d3a5 commented 7 years ago

otsu binarization (we are pretending this is a word)

Ah, ops. Would the correct term be Otsu thresholding then? I'm not particularly familiar with image processing, unfortunately.

We could possibly even display the pre-processed image in the capture box (instead of having it be transparent) before OCR if pre-processing can be done in a fairly short time (<500ms?), then have some sort of simple interface to allow users to adjust the threshold values by a simple mechanism (i.e., tap & hold capture box, then drag to increase/decrease threshold). This sort of manual intervention could be advantageous seeing that there usually isn't a one-size-fits-all solution to these sorts of problems.

Shimizoki commented 7 years ago

I wasn't calling you out on anything. I just don't think Binarization is a word, but it got the concept across that I wanted it to. Thresholding does not imply a binary image... so I dont think that word is correct either. "Using the Otsu algorithm to create a binary image" is just a mouthful. So I am all for making up words that fit the situation.

I am going to roughly define some terms, so forgive me if you know them already:

Otsu Thresholding is the concept of taking a grayscale image and reading the histogram to determine the optimal threshold level for an image.

A histogram is a graph showing the relative percentages of each monocolor value compared to the image. So if an Image is mostly black with a few white dots, it will show a line graph that is taller on the "dark" side. If the Image is split pretty well between almost white and almost black then the graph might look like a U.

What you do with the threshold is up to you, and technically has no bearing on the algorithm itself. We are choosing to create a binary... but we could just as well turn everything above that threshold white and leave everything below it grayscale. Or apply it to the original color image to make everything below the threshold grayscale. (which would create pops of color)

What this means is that Otsu IS the algorithm for a one-size fits all approach, because it calculates the best values for you. (In the simplest sense)

Now of course it can be improved upon in various ways... One such way finds a different threshold for every pixel on the image based off surrounding pixels. This means that if the image is partially shadowed, it gets normalized to remove the shadow and then turned black/white. (more or less)

You are correct, we absolutely could allow the user to change the threshold value, but in theory unless they are doing it per character something like a local Otsu would be more efficient. (Possibly... needs testing, but I like the idea for power users.) Another method would be to allow the user to "Paint" a section black that they know is going to have issues in the OCR. "Whoops... the threshold left some white dots over here which is going to mess things up. Lets just go ahead and remove those."

AllanHasegawa commented 7 years ago

Hey o/

I could try a couple of different techniques. However, to compare them it would be nice to have a set of test images. Is there such a set?

Or some examples where the current implementation fails?

0xbad1d3a5 commented 5 years ago

https://github.com/jasonlfunk/ocr-text-extraction/blob/master/extract_text http://www.m.cs.osakafu-u.ac.jp/cbdar2007/proceedings/papers/O1-1.pdf

KonstantinDjairo commented 1 year ago

you could use manga_ocr

that would make kaku very accurate , idk why but it's so inaccurate that i can't even use it

0xbad1d3a5 / Kaku

Enhance image before sending to Tesseract for better OCR accuracy #6