Hyper-Table-OCR

A carefully-designed OCR pipeline for universal boarded table recognition and reconstruction.

This pipeline covers image preprocessing, table detection(optional), text OCR, table cell extraction, table reconstruction.

Are you seeking ideas for your own work? Visit my blog post on Hyper-Table-OCR to see more!

Update on 2021-08-20: Happy to see that Baidu has released their PP-Structure, which provides higher robustness due to its DL-driven structure prediction feature, instead of simple matching in our work.

Demo

gif demo

Demo Video (In English): YouTube

Demo Video (In Chinese): Bilibili

Features

Flexible modular architecture: by deriving from predefined abstract class, any module of this pipeline can be easily swapped to your preferred one. See the following Want to contribute? part!
A simple yet highly legible web interface.
A table reconstruction strategy based simply on coordinates of each cell, including identifying merged cell row & building table structure.
More to explore...

Getting Started

Clone this repo

git clone https://github.com/MrZilinXiao/Hyper-Table-Recognition
cd Hyper-Table-Recognition

Download weights

Download from here: GoogleDrive

MD5: (004fabb8f6112d6d43457c681b435631 models.zip)

Unzip it and make sure the directory layout matchs:

# ~/Hyper-Table-Recognition$ tree -L 1
.
├── models
├── app.py
├── config.yml
├── ...

Install Dependencies

This project is developed and tested on:

Ubuntu 18.04
RTX 3070 with Driver 455.45.01 & CUDA 11.1 & cuDNN 8.0.4
Python 3.8.3
PyTorch 1.7.0+cu110
Tensorflow 2.5.0
PaddlePaddle 2.0.0-rc1
mmdetection 2.7.0
onnxruntime-gpu 1.6.0

An NVIDIA GPU device is compulsory for reasonable inference duration, while GPU with less than 6GB VRAM may experience Out of Memory exception when loading multiple models. You may comment some models in web/__init__.py if experiencing such situation.

No version-specific framework feature is used in this project, so this means you could still enjoy it with lower versions of these frameworks. However, at this time(19th Dec, 2020), users with RTX 3000 Series device may have no access to compiled binary of Tensorflow, onnxruntime-gpu, mmdetection, PaddlePaddle via pip or conda.

Some building tutorials for Ubuntu are as follows:

Tensorflow: https://gist.github.com/kmhofmann/e368a2ebba05f807fa1a90b3bf9a1e03

PaddlePaddle: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/2.0-rc1/install/compile/linux-compile.html

mmdetection: https://mmdetection.readthedocs.io/en/latest/get_started.html#installation

onnxruntime-gpu: https://github.com/microsoft/onnxruntime/blob/master/BUILD.md

Confirm all deep learning frameworks installation via:

python -c "import tensorflow as tf; print(tf.__version__); import torch; print(torch.__version__); import paddle; print(paddle.__version__); import onnxruntime as rt; print(rt.__version__); import mmdet; print(mmdet.__version__)"

Then install other necessary libraries via:

pip install -r requirements.txt

Enjoy!

python app.py

Visit http://127.0.0.1:5000 to see the main page!

Performance

Inference time consumption is highly related with following factors:

Complexity of table structure
Number of OCR blocks
Resolution of selected image

A typical inference time consumption is shown in Demo Video.

Want to contribute?

Contribute a new cell extractor

In boardered/extractor.py, we define a TraditionalExtractor based on traditional computer vision techniques and a UNetExtractor based on UNet pixel-level sematic segmentation model. Feel free to derive from the following abstract class:

class CellExtractor(ABC):
    """
    A unified interface for boardered extractor.
    OpenCV & UNet Extractor can derive from this interface.
    """

    def __init__(self):
        pass

    def get_cells(self, ori_img, table_coords) -> List[np.ndarray]:
        """
        :param ori_img: original image
        :param table_coords: List[np.ndarray], xyxy coord of each table
        :return: List[np.ndarray], [[xyxyxyxy(cell1), xyxyxyxy(cell2)](table1), ...]
        """
        pass

Contribute a new OCR Module

Located in ocr/__init__.py, you should build a custom OCR handler deriving from OCRHandler.

class OCRHandler(metaclass=abc.ABCMeta):
    """
    Handler for OCR Support
    An abstract class, any OCR implementations may derive from it
    """

    def __init__(self, *kw, **kwargs):
        pass

    def get_result(self, ori_img):
        """
        Interface for OCR inference
        :param ori_img: np.ndarray
        :return: dict, in following format:
        {'sentences': [['麦格尔特杯表格OCR测试表格2', [[85.0, 10.0], [573.0, 30.0], [572.0, 54.0], [84.0, 33.0]], 0.9],...]}
        """
        pass

Contribute to the process pipeline

WebHandler.pipeline() in web/__init__.py

Future Plans

[ ] Speed up inference via async-processing on dual GPUs.

Congratulations! This project earns a GRAND PRIZE(2 out of 72 participators) of the aforementioned competition!

Acknowledgement

PaddleOCR: Multilingual, awesome, leading, and practical OCR tools supported by Baidu.
ChineseOCR_lite: Super light OCR inference tool kit.
CascadeTabNet: An automatic table recognition method for interpretation of tabular data in document images.
pytorch-hed: An unofficial implementation of Holistically-Nested Edge Detection using PyTorch.
table-detect: Excellent work providing us with the U-Net code and pretrained weight.

MrZilinXiao / Hyper-Table-OCR

readme