KewBridge / LightfootCatalogue

0 stars 0 forks source link

Splitting two page images into separate pages #5

Open ipiriyan2002 opened 4 days ago

ipiriyan2002 commented 4 days ago

Given the catalogue images shared by Marie-Hellen is of two pages in one image, the model will not be able to handle accurate reading/transcribing. Furthermore, batch inferencing would not be possible.

It is best to look into adding a feature in the pipeline that splits these images into two separate images.

ipiriyan2002 commented 4 days ago

I have looked into using the QWEN model to detect the bounding boxes but as the HPC is down, no further tests could be done. Currently it seems not possible so will be investing traditional methods such as edge detection.

ipiriyan2002 commented 4 days ago

Traditional methods were tested and they show promise. Most of the image were able to be split (80% or more) with the current prototype. More finetuning needs to be done to see why some of the images are not able to split.

ipiriyan2002 commented 3 days ago

Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.

Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.

ipiriyan2002 commented 3 days ago

Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.

Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.

Further to this, reducing the image resolution to 900x1600 (with the background noise) worked in resolving the issue with accuracy and transcribing. No need for another pre-trained model.

nickynicolson commented 23 hours ago

@ipiriyan2002 this uses tesseract to construct a box around all detected text (so extracting on this basis would remove unnecessary parts of the image) - I guess it could be modified to allocate the blocks into pages / columns.

from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, PSM, RIL

with PyTessBaseAPI() as api:
    print(api.GetAvailableLanguages())

with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
    image = Image.open("The_Lightfoot_Herbarium_34.jpg")
    api.SetImage(image)
    api.Recognize()

    draw = ImageDraw.Draw(image)

    x = None
    y = None
    w = None
    h = None
    boxes = api.GetComponentImages(RIL.BLOCK, True)
    for i, (im, box, _, _) in enumerate(boxes):
        # im is a PIL image object
        # box is a dict with x, y, w and h keys
        x = box['x'] if x is None else min(x, box['x'])
        y = box['y'] if y is None else min(y, box['y'])
        w = box['x'] + box['w'] if w is None else max(w, box['x'] + box['w'])
        h = box['y'] + box['h'] if h is None else max(h, box['y'] + box['h'])
        draw.rectangle([box['x'], box['y'], box['x'] + box['w'], box['y'] + box['h']], outline='green', width=3)

    draw.rectangle([x, y, w, h], outline='red', width=3)

    image.save('output_with_blocks.jpg')