Closed ipiriyan2002 closed 2 weeks ago
I have looked into using the QWEN model to detect the bounding boxes but as the HPC is down, no further tests could be done. Currently it seems not possible so will be investing traditional methods such as edge detection.
Traditional methods were tested and they show promise. Most of the image were able to be split (80% or more) with the current prototype. More finetuning needs to be done to see why some of the images are not able to split.
Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.
Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.
Just splitting two-page images into one-page images does not seems to improve the accuracy. Due to the extra background noise such as book cover and the blank white spaces, the text are too small / blurry to be easily read by the OCR.
Running a less powerful but pre-trained model for finetuned for multi-column detection might solve this problem. Currently looking into Tesseract OCR but it might be inaccurate with the extra noise.
Further to this, reducing the image resolution to 900x1600 (with the background noise) worked in resolving the issue with accuracy and transcribing. No need for another pre-trained model.
@ipiriyan2002 this uses tesseract to construct a box around all detected text (so extracting on this basis would remove unnecessary parts of the image) - I guess it could be modified to allocate the blocks into pages / columns.
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, PSM, RIL
with PyTessBaseAPI() as api:
print(api.GetAvailableLanguages())
with PyTessBaseAPI(psm=PSM.AUTO_OSD) as api:
image = Image.open("The_Lightfoot_Herbarium_34.jpg")
api.SetImage(image)
api.Recognize()
draw = ImageDraw.Draw(image)
x = None
y = None
w = None
h = None
boxes = api.GetComponentImages(RIL.BLOCK, True)
for i, (im, box, _, _) in enumerate(boxes):
# im is a PIL image object
# box is a dict with x, y, w and h keys
x = box['x'] if x is None else min(x, box['x'])
y = box['y'] if y is None else min(y, box['y'])
w = box['x'] + box['w'] if w is None else max(w, box['x'] + box['w'])
h = box['y'] + box['h'] if h is None else max(h, box['y'] + box['h'])
draw.rectangle([box['x'], box['y'], box['x'] + box['w'], box['y'] + box['h']], outline='green', width=3)
draw.rectangle([x, y, w, h], outline='red', width=3)
image.save('output_with_blocks.jpg')
This issue is closed now, as the problem of splitting two double page images into two separate images was solved by the following method.
The images were first cleaned and the region of interest was captured using tesseract OCR.
The region of interest was then split into two pages by find the cut off point using gradient thresholding and Hough line transform.
Given the catalogue images shared by Marie-Hellen is of two pages in one image, the model will not be able to handle accurate reading/transcribing. Furthermore, batch inferencing would not be possible.
It is best to look into adding a feature in the pipeline that splits these images into two separate images.