felixdittrich92/OnnxTR - Githubissues

:warning: Please note that this is a wrapper around the doctr library to provide a Onnx pipeline for docTR. For feature requests, which are not directly related to the Onnx pipeline, please refer to the base project.

Optical Character Recognition made seamless & accessible to anyone, powered by Onnx

What you can expect from this repository:

efficient ways to parse textual information (localize and identify each word) from your documents
a Onnx pipeline for docTR, a wrapper around the doctr library - no PyTorch or TensorFlow dependencies
more lightweight package with faster inference latency and less required resources
8-Bit quantized models for faster inference on CPU

OCR_example

Installation

Prerequisites

Python 3.9 (or higher) and pip are required to install OnnxTR.

Latest release

You can then install the latest release of the package using pypi as follows:

NOTE:

For GPU support please take a look at: ONNX Runtime. Currently supported execution providers by default are: CPU, CUDA

Prerequisites: CUDA & cuDNN needs to be installed before Version table.

pip install "onnxtr[cpu]"
# with gpu support
pip install "onnxtr[gpu]"
# with HTML support
pip install "onnxtr[html]"
# with support for visualization
pip install "onnxtr[viz]"
# with support for all dependencies
pip install "onnxtr[html, gpu, viz]"

Reading files

Documents can be interpreted from PDF / Images / Webpages / Multiple page images using the following code snippet:

from onnxtr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Webpage (requires `weasyprint` to be installed)
webpage_doc = DocumentFile.from_url("https://www.yoursite.com")
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])

Putting it together

Let's use the default ocr_predictor model for an example:

from onnxtr.io import DocumentFile
from onnxtr.models import ocr_predictor, EngineConfig

model = ocr_predictor(
    det_arch='fast_base',  # detection architecture
    reco_arch='vitstr_base',  # recognition architecture
    det_bs=2, # detection batch size
    reco_bs=512, # recognition batch size
    assume_straight_pages=True,  # set to `False` if the pages are not straight (rotation, perspective, etc.) (default: True)
    straighten_pages=False,  # set to `True` if the pages should be straightened before final processing (default: False)
    # Preprocessing related parameters
    preserve_aspect_ratio=True,  # set to `False` if the aspect ratio should not be preserved (default: True)
    symmetric_pad=True,  # set to `False` to disable symmetric padding (default: True)
    # Additional parameters - meta information
    detect_orientation=False,  # set to `True` if the orientation of the pages should be detected (default: False)
    detect_language=False, # set to `True` if the language of the pages should be detected (default: False)
    # DocumentBuilder specific parameters
    resolve_lines=True,  # whether words should be automatically grouped into lines (default: True)
    resolve_blocks=False,  # whether lines should be automatically grouped into blocks (default: False)
    paragraph_break=0.035,  # relative length of the minimum space separating paragraphs (default: 0.035)
    # OnnxTR specific parameters
    # NOTE: 8-Bit quantized models are not available for FAST detection models and can in general lead to poorer accuracy
    load_in_8_bit=False,  # set to `True` to load 8-bit quantized models instead of the full precision onces (default: False)
    # Advanced engine configuration options
    det_engine_cfg=EngineConfig(),  # detection model engine configuration (default: internal predefined configuration)
    reco_engine_cfg=EngineConfig(),  # recognition model engine configuration (default: internal predefined configuration)
    clf_engine_cfg=EngineConfig(),  # classification (orientation) model engine configuration (default: internal predefined configuration)
)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)
# Display the result (requires matplotlib & mplcursors to be installed)
result.show()

Visualization sample

Or even rebuild the original document from its predictions:

import matplotlib.pyplot as plt

synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

Synthesis sample

The ocr_predictor returns a Document object with a nested structure (with Page, Block, Line, Word, Artefact). To get a better understanding of the document model, check out documentation:

You can also export them as a nested dict, more appropriate for JSON format / render it or export as XML (hocr format):

json_output = result.export()  # nested dict
text_output = result.render()  # human-readable text
xml_output = result.export_as_xml()  # hocr format
for output in xml_output:
    xml_bytes_string = output[0]
    xml_element = output[1]

Advanced engine configuration options

You can also define advanced engine configurations for the models / predictors: ```python from onnxruntime import SessionOptions from onnxtr.models import ocr_predictor, EngineConfig general_options = SessionOptions() # For configuartion options see: https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions general_options.enable_cpu_mem_arena = False # NOTE: The following would force to run only on the GPU if no GPU is available it will raise an error # List of strings e.g. ["CUDAExecutionProvider", "CPUExecutionProvider"] or a list of tuples with the provider and its options e.g. # [("CUDAExecutionProvider", {"device_id": 0}), ("CPUExecutionProvider", {"arena_extend_strategy": "kSameAsRequested"})] providers = [("CUDAExecutionProvider", {"device_id": 0, "cudnn_conv_algo_search": "DEFAULT"})] # For available providers see: https://onnxruntime.ai/docs/execution-providers/ engine_config = EngineConfig( session_options=general_options, providers=providers ) # We use the default predictor with the custom engine configuration # NOTE: You can define differnt engine configurations for detection, recognition and classification depending on your needs predictor = ocr_predictor( det_engine_cfg=engine_config, reco_engine_cfg=engine_config, clf_engine_cfg=engine_config ) ```

Loading custom exported models

You can also load docTR custom exported models: For exporting please take a look at the doctr documentation.

from onnxtr.models import ocr_predictor, linknet_resnet18, parseq

reco_model = parseq("path_to_custom_model.onnx", vocab="ABC")
det_model = linknet_resnet18("path_to_custom_model.onnx")
model = ocr_predictor(det_arch=det_model, reco_arch=reco_model)

Models architectures

Credits where it's due: this repository provides ONNX models for the following architectures, converted from the docTR models:

Text Detection

Text Recognition

predictor = ocr_predictor()
predictor.list_archs()
{
    'detection archs':
        [
            'db_resnet34',
            'db_resnet50',
            'db_mobilenet_v3_large',
            'linknet_resnet18',
            'linknet_resnet34',
            'linknet_resnet50',
            'fast_tiny',  # No 8-bit support
            'fast_small',  # No 8-bit support
            'fast_base'  # No 8-bit support
        ],
    'recognition archs':
        [
            'crnn_vgg16_bn',
            'crnn_mobilenet_v3_small',
            'crnn_mobilenet_v3_large',
            'sar_resnet31',
            'master',
            'vitstr_small',
            'vitstr_base',
            'parseq'
        ]
}

Documentation

This repository is in sync with the doctr library, which provides a high-level API to perform OCR on documents. This repository stays up-to-date with the latest features and improvements from the base project. So we can refer to the doctr documentation for more detailed information.

NOTE:

pretrained is the default in OnnxTR, and not available as a parameter.
docTR specific environment variables (e.g.: DOCTR_CACHE_DIR -> ONNXTR_CACHEDIR) needs to be replaced with `ONNXTR` prefix.

Benchmarks

The CPU benchmarks was measured on a i7-14700K Intel CPU.

The GPU benchmarks was measured on a RTX 4080 Nvidia GPU.

Benchmarking performed on the FUNSD dataset and CORD dataset.

docTR / OnnxTR models used for the benchmarks are fast_base (full precision) | db_resnet50 (8-bit variant) for detection and crnn_vgg16_bn for recognition.

The smallest combination in OnnxTR (docTR) of db_mobilenet_v3_large and crnn_mobilenet_v3_small takes as comparison ~0.17s / Page on the FUNSD dataset and ~0.12s / Page on the CORD dataset in full precision.

CPU benchmarks:

Library	FUNSD (199 pages)	CORD (900 pages)
docTR (CPU) - v0.8.1	~1.29s / Page	~0.60s / Page
OnnxTR (CPU) - v0.1.2	~0.57s / Page	~0.25s / Page
OnnxTR (CPU) 8-bit - v0.1.2	~0.38s / Page	~0.14s / Page
EasyOCR (CPU) - v1.7.1	~1.96s / Page	~1.75s / Page
PyTesseract (CPU) - v0.3.10	~0.50s / Page	~0.52s / Page
Surya (line) (CPU) - v0.4.4	~48.76s / Page	~35.49s / Page
PaddleOCR (CPU) - no cls - v2.7.3	~1.27s / Page	~0.38s / Page

GPU benchmarks:

Library	FUNSD (199 pages)	CORD (900 pages)
docTR (GPU) - v0.8.1	~0.07s / Page	~0.05s / Page
docTR (GPU) float16 - v0.8.1	~0.06s / Page	~0.03s / Page
OnnxTR (GPU) - v0.1.2	~0.06s / Page	~0.04s / Page
EasyOCR (GPU) - v1.7.1	~0.31s / Page	~0.19s / Page
Surya (GPU) float16 - v0.4.4	~3.70s / Page	~2.81s / Page
PaddleOCR (GPU) - no cls - v2.7.3	~0.08s / Page	~0.03s / Page

Citation

If you wish to cite please refer to the base project citation, feel free to use this BibTeX reference:

@misc{doctr2021,
    title={docTR: Document Text Recognition},
    author={Mindee},
    year={2021},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/mindee/doctr}}
}

License

Distributed under the Apache 2.0 License. See LICENSE for more information.