Changing the target language for OCR'ing

anhhaibkhn commented 2 years ago

Thank you for sharing the project. Is this possible to reproduce your model to adapt to other languages?

I could extract the cell coordinates from the table, but I am facing difficulty reconstructing the table, especially for the tables having merged cells. For example: numerous_merged_cells_table

Could you explain further the idea of how to reconstruct the table?

Thanks for your time.

MrZilinXiao commented 2 years ago

Hi, @anhhaibkhn You might refer to specific reconstruction logic here: https://github.com/MrZilinXiao/Hyper-Table-OCR/blob/1432b1f7626cc96f868b329b7c6d1e58e49f813e/table/__init__.py#L193 Feel free to contact me if the code troubles you.

anhhaibkhn commented 2 years ago

Hi @MrZilinXiao , Thank you for your reply.

Is this possible to reproduce your model to adapt to the Non-English languages (such as Japanese)? I did not get your reconstruction algorithm intuition, Could you explain further the input parameters (delta_y=10, delta_x=10, overlap_thr=0.3) and your final output?

Thanks in advance.

MrZilinXiao commented 2 years ago

In fact, the structure reconstruction model (A UNet in this project) is irrelevant with the language in your input, since it only produces the structure of the given table. If you'd like to reproduce the demo in our GIF, all you need to do is to replace the OCR module, making it adapt to your target language. An OCRHandler class is here: https://github.com/MrZilinXiao/Hyper-Table-OCR/blob/main/ocr/__init__.py. You may just subclass it, naming it JapaneseOCRHandler or something else.

Talking back to the code, here's the meaning of the default parameter list:

delta_y: the number of pixels between two horizontal lines, that the algorithm believes a new line begins.
delta_x: the number of pixels between two vertical lines, that the algorithm believes a new column begins.
overlap_thr: the algorithm sometimes produces boxes that overlap with others, resulting in incorrect reconstruction structure. So at the end of reconstruction, we remove those boxes that overlap with others too much (ratio more than overlap_thr)

The output is self.rows: List[List[TableCell]], which is a 2D table reconstruction result.

BTW, be aware of this issue: https://github.com/MrZilinXiao/Hyper-Table-OCR/issues/2. (use google translate if Chinese troubles you)

anhhaibkhn commented 2 years ago

Hi @MrZilinXiao ! Thank you very much for your detailed and instant reply.

From what I understand (via Google translate your provided link) this project currently only supports cell merge detection in the row direction. I am sorry, but I haven't fully understood, how the reconstruction algorithm work, for example: self.rows: np.ndarray[List[TableCell]] = np.array(self.rows) and those attributes in the TableCell constructor. Perhaps, could you give a simple table example for me to have quick look at how it works?

I will also try to adapt it to the Japanese language in the OCR settings module and let you know how it goes.

MrZilinXiao commented 2 years ago

Yes, this project currently only supports cell merge detection in the row direction, so it has difficulty dealing with the example image you provide in the description of this issue since it has both cell merge in row and column direction.

class OCRBlock(object):
    def __init__(self, coord, content, conf=-1.0):
        self.coord: np.ndarray = coord  # xyxyxyxy
        self.conf = conf
        assert len(coord) == 8, "xyxyxyxy not fit for OCRBlock!"
        self.shape = Polygon([coord[0:2], coord[2:4], coord[4:6], coord[6:]])
        self.ocr_content: Union[List[str], str] = content

class TableCell(OCRBlock):
    def __init__(self, coord):
        super(TableCell, self).__init__(coord, [])
        self.matched = False
        self.row_range = [-1, -1]
        self.col_range = [-1, -1]

    @property
    def upper_y(self):
        return self.coord[[1, 3]].mean()

    @property
    def left_x(self):
        return self.coord[[0, 6]].mean()

    @property
    def right_x(self):
        return self.coord[[2, 4]].mean()

TableCell subclasses OCRBlock, whose attributes are as follow:

coord: coordinates of a cell, in the format of xyxyxyxy
conf: confidence
shape: a Polygon object supported by shapely.geometry, used to compute overlap area
ocr_contect: the content that OCRHandler would fill in
matched: whether the TableCell belongs to an OCRBlock

row_range and col_range might be difficult to understand, so here I make a picture by hand (all indexes follow 0-index tradition):

Considering a table containing only row-merged cells, the row_range of cell A, B, C is [0, 0], [0,0], [2,2] respectively, and the col_range of cell A, B, C is [0, 0], [1,2] and [1,2]. The dotted line is only for clear depiction and it does not exists in the table.

Hope this solves your problem. Sorry that I don't have a running example with debug info since I already shifted my research interest from Table OCR.

MrZilinXiao / Hyper-Table-OCR

Changing the target language for OCR'ing #12