Layout-Parser / layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis
https://layout-parser.github.io/
Apache License 2.0
4.78k stars 459 forks source link

Layout Parser text boxes not properly aligned causing incorrect sorting of text boxes #50

Open farazk86 opened 3 years ago

farazk86 commented 3 years ago

Hi,

I'm using layout parser to perform OCR on a research paper, but on almost every page of the pdf the text boxes are not properly aligned. For example I input this page:

image

perform detection using:

model = lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config', 
                                 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
                                 label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
layout = model.detect(image)

# Show the detected layout of the input image
lp.draw_box(image, layout, box_width=3)

The detected image is shown below:

detect

As can be seen, the bottom left box is not properly aligned, which causes problem with the sort script, as given in the tutorial:

# sort the left and right blocks and assign id to each
h, w = image.size

left_interval = lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image)

left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key = lambda b:b.coordinates[1])

right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key = lambda b:b.coordinates[1])

# And finally combine the two list and add the index
# according to the order
text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)])

# visualize the cleaned text blocks
lp.draw_box(image, text_blocks,
            box_width=3, 
            show_element_id=True)

detect_sort

The misaligned box is given an index of 0. Which is not correct.

Is there any way to avoid this problem?

Thank you

lolipopshock commented 3 years ago

Thanks - this is more of an issue from the detection model (it's very very hard to generate perfect bounding box detections for these models). I have script that can fix this issue, but could not share with you right now due to some copyright issues -- it should be ready within the next few weeks, and please stay tuned.

avibagul commented 3 years ago

Hi there,

First thing, remove 1.05 from the below line. i.e. don't multiply at all.

   left_interval = lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image)

If that does not work for you, Create your own function to append two lists and sort them using y1. Assuming that you only have 2 column layout throughout your document. Two lists to hold left and right should do the work.

text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)])

replace your left and right with their left and right. and ka-boom it works.

Happy coding :)

SAIVENKATARAJU commented 2 years ago

I have another approach to separate the layouts. if we want to separate left and right layout we can simply Use Kmeans clustering algorithm with number of clusters=2.

talhaanwarch commented 1 year ago

if there are two columns, find the median of right column first coordinate and calculate it difference from the all the coordinates of left columns. If the coordinates of left column block is greater than median, remove the block from left column and append it to right column. sort both blocks again. and you are done