:warning: Please note that this is a wrapper around the doctr library to provide a Onnx pipeline for docTR. For feature requests, which are not directly related to the Onnx pipeline, please refer to the base project.
Optical Character Recognition made seamless & accessible to anyone, powered by Onnx
What you can expect from this repository:
Python 3.9 (or higher) and pip are required to install OnnxTR.
You can then install the latest release of the package using pypi as follows:
NOTE:
For GPU support please take a look at: ONNX Runtime. Currently supported execution providers by default are: CPU, CUDA
pip install "onnxtr[cpu]"
# with gpu support
pip install "onnxtr[gpu]"
# with HTML support
pip install "onnxtr[html]"
# with support for visualization
pip install "onnxtr[viz]"
# with support for all dependencies
pip install "onnxtr[html, gpu, viz]"
Documents can be interpreted from PDF / Images / Webpages / Multiple page images using the following code snippet:
from onnxtr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Webpage (requires `weasyprint` to be installed)
webpage_doc = DocumentFile.from_url("https://www.yoursite.com")
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])
Let's use the default ocr_predictor
model for an example:
from onnxtr.io import DocumentFile
from onnxtr.models import ocr_predictor, EngineConfig
model = ocr_predictor(
det_arch='fast_base', # detection architecture
reco_arch='vitstr_base', # recognition architecture
det_bs=2, # detection batch size
reco_bs=512, # recognition batch size
assume_straight_pages=True, # set to `False` if the pages are not straight (rotation, perspective, etc.) (default: True)
straighten_pages=False, # set to `True` if the pages should be straightened before final processing (default: False)
# Preprocessing related parameters
preserve_aspect_ratio=True, # set to `False` if the aspect ratio should not be preserved (default: True)
symmetric_pad=True, # set to `False` to disable symmetric padding (default: True)
# Additional parameters - meta information
detect_orientation=False, # set to `True` if the orientation of the pages should be detected (default: False)
detect_language=False, # set to `True` if the language of the pages should be detected (default: False)
# DocumentBuilder specific parameters
resolve_lines=True, # whether words should be automatically grouped into lines (default: True)
resolve_blocks=False, # whether lines should be automatically grouped into blocks (default: False)
paragraph_break=0.035, # relative length of the minimum space separating paragraphs (default: 0.035)
# OnnxTR specific parameters
# NOTE: 8-Bit quantized models are not available for FAST detection models and can in general lead to poorer accuracy
load_in_8_bit=False, # set to `True` to load 8-bit quantized models instead of the full precision onces (default: False)
# Advanced engine configuration options
det_engine_cfg=EngineConfig(), # detection model engine configuration (default: internal predefined configuration)
reco_engine_cfg=EngineConfig(), # recognition model engine configuration (default: internal predefined configuration)
clf_engine_cfg=EngineConfig(), # classification (orientation) model engine configuration (default: internal predefined configuration)
)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)
# Display the result (requires matplotlib & mplcursors to be installed)
result.show()
Or even rebuild the original document from its predictions:
import matplotlib.pyplot as plt
synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()
The ocr_predictor
returns a Document
object with a nested structure (with Page
, Block
, Line
, Word
, Artefact
).
To get a better understanding of the document model, check out documentation:
You can also export them as a nested dict, more appropriate for JSON format / render it or export as XML (hocr format):
json_output = result.export() # nested dict
text_output = result.render() # human-readable text
xml_output = result.export_as_xml() # hocr format
for output in xml_output:
xml_bytes_string = output[0]
xml_element = output[1]
You can also load docTR custom exported models: For exporting please take a look at the doctr documentation.
from onnxtr.models import ocr_predictor, linknet_resnet18, parseq
reco_model = parseq("path_to_custom_model.onnx", vocab="ABC")
det_model = linknet_resnet18("path_to_custom_model.onnx")
model = ocr_predictor(det_arch=det_model, reco_arch=reco_model)
Credits where it's due: this repository provides ONNX models for the following architectures, converted from the docTR models:
predictor = ocr_predictor()
predictor.list_archs()
{
'detection archs':
[
'db_resnet34',
'db_resnet50',
'db_mobilenet_v3_large',
'linknet_resnet18',
'linknet_resnet34',
'linknet_resnet50',
'fast_tiny', # No 8-bit support
'fast_small', # No 8-bit support
'fast_base' # No 8-bit support
],
'recognition archs':
[
'crnn_vgg16_bn',
'crnn_mobilenet_v3_small',
'crnn_mobilenet_v3_large',
'sar_resnet31',
'master',
'vitstr_small',
'vitstr_base',
'parseq'
]
}
This repository is in sync with the doctr library, which provides a high-level API to perform OCR on documents. This repository stays up-to-date with the latest features and improvements from the base project. So we can refer to the doctr documentation for more detailed information.
NOTE:
pretrained
is the default in OnnxTR, and not available as a parameter.The CPU benchmarks was measured on a i7-14700K Intel CPU
.
The GPU benchmarks was measured on a RTX 4080 Nvidia GPU
.
Benchmarking performed on the FUNSD dataset and CORD dataset.
docTR / OnnxTR models used for the benchmarks are fast_base
(full precision) | db_resnet50
(8-bit variant) for detection and crnn_vgg16_bn
for recognition.
The smallest combination in OnnxTR (docTR) of db_mobilenet_v3_large
and crnn_mobilenet_v3_small
takes as comparison ~0.17s / Page
on the FUNSD dataset and ~0.12s / Page
on the CORD dataset in full precision.
Library | FUNSD (199 pages) | CORD (900 pages) |
---|---|---|
docTR (CPU) - v0.8.1 | ~1.29s / Page | ~0.60s / Page |
OnnxTR (CPU) - v0.1.2 | ~0.57s / Page | ~0.25s / Page |
OnnxTR (CPU) 8-bit - v0.1.2 | ~0.38s / Page | ~0.14s / Page |
EasyOCR (CPU) - v1.7.1 | ~1.96s / Page | ~1.75s / Page |
PyTesseract (CPU) - v0.3.10 | ~0.50s / Page | ~0.52s / Page |
Surya (line) (CPU) - v0.4.4 | ~48.76s / Page | ~35.49s / Page |
PaddleOCR (CPU) - no cls - v2.7.3 | ~1.27s / Page | ~0.38s / Page |
Library | FUNSD (199 pages) | CORD (900 pages) |
---|---|---|
docTR (GPU) - v0.8.1 | ~0.07s / Page | ~0.05s / Page |
docTR (GPU) float16 - v0.8.1 | ~0.06s / Page | ~0.03s / Page |
OnnxTR (GPU) - v0.1.2 | ~0.06s / Page | ~0.04s / Page |
EasyOCR (GPU) - v1.7.1 | ~0.31s / Page | ~0.19s / Page |
Surya (GPU) float16 - v0.4.4 | ~3.70s / Page | ~2.81s / Page |
PaddleOCR (GPU) - no cls - v2.7.3 | ~0.08s / Page | ~0.03s / Page |
If you wish to cite please refer to the base project citation, feel free to use this BibTeX reference:
@misc{doctr2021,
title={docTR: Document Text Recognition},
author={Mindee},
year={2021},
publisher = {GitHub},
howpublished = {\url{https://github.com/mindee/doctr}}
}
Distributed under the Apache 2.0 License. See LICENSE
for more information.