Splitting two page images into separate pages

ipiriyan2002 commented 1 month ago

Given the catalogue images shared by Marie-Hellen is of two pages in one image, the model will not be able to handle accurate reading/transcribing. Furthermore, batch inferencing would not be possible.

It is best to look into adding a feature in the pipeline that splits these images into two separate images.

ipiriyan2002 commented 1 month ago

I have looked into using the QWEN model to detect the bounding boxes but as the HPC is down, no further tests could be done. Currently it seems not possible so will be investing traditional methods such as edge detection.

ipiriyan2002 commented 1 month ago

Traditional methods were tested and they show promise. Most of the image were able to be split (80% or more) with the current prototype. More finetuning needs to be done to see why some of the images are not able to split.

ipiriyan2002 commented 1 month ago

Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.

Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.

ipiriyan2002 commented 1 month ago

Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.

Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.

Further to this, reducing the image resolution to 900x1600 (with the background noise) worked in resolving the issue with accuracy and transcribing. No need for another pre-trained model.

nickynicolson commented 1 month ago

@ipiriyan2002 this uses tesseract to construct a box around all detected text (so extracting on this basis would remove unnecessary parts of the image) - I guess it could be modified to allocate the blocks into pages / columns.

from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, PSM, RIL

with PyTessBaseAPI() as api:
    print(api.GetAvailableLanguages())

with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
    image = Image.open("The_Lightfoot_Herbarium_34.jpg")
    api.SetImage(image)
    api.Recognize()

    draw = ImageDraw.Draw(image)

    x = None
    y = None
    w = None
    h = None
    boxes = api.GetComponentImages(RIL.BLOCK, True)
    for i, (im, box, _, _) in enumerate(boxes):
        # im is a PIL image object
        # box is a dict with x, y, w and h keys
        x = box['x'] if x is None else min(x, box['x'])
        y = box['y'] if y is None else min(y, box['y'])
        w = box['x'] + box['w'] if w is None else max(w, box['x'] + box['w'])
        h = box['y'] + box['h'] if h is None else max(h, box['y'] + box['h'])
        draw.rectangle([box['x'], box['y'], box['x'] + box['w'], box['y'] + box['h']], outline='green', width=3)

    draw.rectangle([x, y, w, h], outline='red', width=3)

    image.save('output_with_blocks.jpg')

ipiriyan2002 commented 2 weeks ago

This issue is closed now, as the problem of splitting two double page images into two separate images was solved by the following method.

The images were first cleaned and the region of interest was captured using tesseract OCR.

The region of interest was then split into two pages by find the cut off point using gradient thresholding and Hough line transform.

KewBridge / LightfootCatalogue

Splitting two page images into separate pages #5