Academic-Hammer / SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition
https://arxiv.org/pdf/1908.04729.pdf
MIT License
350 stars 57 forks source link

How to match the chunk coordinates with the img? #1

Closed chixma closed 5 years ago

CZWin32768 commented 5 years ago

We matched the extracted chunks and the labeled cells by comparing their contents. Although we cannot perfectly match them all, we can still obtain a good alignment.

For example, if chunk1 and chunk2 are aligned to the cell_a and cell_b, and the relation between cell_a and cell_b is vertical. Then, we know the relation between chunk1 and chunk2 is also vertical.

We will provide a more detailed description in README ASAP.

chixma commented 5 years ago

Thank you for your reply. Well, I notice that the ‘pos’ coordinates in chunk file is corresponding to the pdf file but not the img file. Is there a way to get the text bounding box coordinates in the img file? It seems that you mentioned a way to matching in the paper.

smilewsw commented 5 years ago

Have you solved this problem? I have the same issue.

DaDaMrX commented 5 years ago

We are sorry that we didn't preserve the align info between chunks and images, but you can align them by the following steps:

Step 1. Convert PDF to PNG by pdftocairo tool

In terminal:

pdftocairo -gray -png -singlefile <pdf_path> <png_path>

Step 2. Crop the Table Area in the PNG image

Python code:

import PIL.Image
import cv2

def crop(img_path):
    img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    thresh, gray = cv2.threshold(
        img, 200, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
    coords = cv2.findNonZero(gray)
    x, y, w, h = cv2.boundingRect(coords)
    left, top, right, bottom = x - 3, y - 3, x + w + 3, y + h + 3
    rect = img[top:bottom, left:right]
    cv2.imwrite(img_path, rect)
    return left, top, right, bottom

The function crop returns the coords of Table Area. left, top, right, bottom are corresponding to l, t, r, b in the figure blew.

Step 3. Transform coords from PDF to Table Area

The size of an A4 paper in PDF is 595 × 842 (points) and the origin in coordinate system is on the left-bottom.

The size of the PNG image obtained by pdftocairo is 1241 x 1754 (pixels) while the origin is on the left-top.

So, we can covert the coord by the following function:

# Assume we got left and top from crop function
def convert(x, y):
    ratio = 1754 / 842
    new_x = x * ratio - left
    new_y = 1754 - y * ratio - top
    return new_x, new_y

The following figure may help you understand the steps better.

huntzhan commented 4 years ago

Could you help explain the magic number 3 used in x - 3, y - 3, x + w + 3, y + h + 3 ?

CZWin32768 commented 4 years ago

We leave a margin as 3 to prevent image edges from overlapping with table edges.

image

huntzhan commented 4 years ago

thx