VikParuchuri / surya

OCR, layout analysis, reading order, line detection in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
9.12k stars 566 forks source link

Surya

Surya is a document OCR toolkit that does:

It works on a range of documents (see usage and benchmarks for more details).

Detection OCR
New York Times Article Detection New York Times Article Recognition
Layout Reading Order
New York Times Article Layout New York Times Article Reading Order

Surya is named for the Hindu sun god, who has universal vision.

Community

Discord is where we discuss future development.

Examples

Name Detection OCR Layout Order
Japanese Image Image Image Image
Chinese Image Image Image Image
Hindi Image Image Image Image
Arabic Image Image Image Image
Chinese + Hindi Image Image Image Image
Presentation Image Image Image Image
Scientific Paper Image Image Image Image
Scanned Document Image Image Image Image
New York Times Image Image Image Image
Scanned Form Image Image Image Image
Textbook Image Image Image Image

Hosted API

There is a hosted API for all surya models available here:

Commercial usage

I want surya to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install surya-ocr

Model weights will automatically download the first time you run surya. Note that this does not work with the latest version of transformers 4.37+ yet, so you will need to keep 4.36.2, which is installed with surya.

Usage

Interactive App

I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:

pip install streamlit
surya_gui

Pass the --math command line argument to use the math text detection model instead of the default model. This will detect math better, but will be worse at everything else.

OCR (text recognition)

This command will write out a json file with the detected text and bboxes:

surya_ocr DATA_PATH --images --langs hi,en

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

Performance tips

Setting the RECOGNITION_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 50MB of VRAM, so very high batch sizes are possible. The default is a batch size 256, which will use about 12.8GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is 32.

From python

from PIL import Image
from surya.ocr import run_ocr
from surya.model.detection import segformer
from surya.model.recognition.model import load_model
from surya.model.recognition.processor import load_processor

image = Image.open(IMAGE_PATH)
langs = ["en"] # Replace with your languages
det_processor, det_model = segformer.load_processor(), segformer.load_model()
rec_model, rec_processor = load_model(), load_processor()

predictions = run_ocr([image], [langs], det_model, det_processor, rec_model, rec_processor)

Compilation

The OCR model can be compiled to get an ~15% speedup in total inference time. The first run will be slow while it compiles, though. First set RECOGNITION_STATIC_CACHE=true, then:

import torch

rec_model.decoder.model.decoder = torch.compile(rec_model.decoder.model.decoder)

Text line detection

This command will write out a json file with the detected bboxes.

surya_detect DATA_PATH --images

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

Performance tips

Setting the DETECTOR_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 280MB of VRAM, so very high batch sizes are possible. The default is a batch size 32, which will use about 9GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 2.

From python

from PIL import Image
from surya.detection import batch_text_detection
from surya.model.detection.segformer import load_model, load_processor

image = Image.open(IMAGE_PATH)
model, processor = load_model(), load_processor()

# predictions is a list of dicts, one per image
predictions = batch_text_detection([image], model, processor)

Layout analysis

This command will write out a json file with the detected layout.

surya_layout DATA_PATH --images

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

Performance tips

Setting the DETECTOR_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 280MB of VRAM, so very high batch sizes are possible. The default is a batch size 32, which will use about 9GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 2.

From python

from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.segformer import load_model, load_processor
from surya.settings import settings

image = Image.open(IMAGE_PATH)
model = load_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
processor = load_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
det_model = load_model()
det_processor = load_processor()

# layout_predictions is a list of dicts, one per image
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)

Reading order

This command will write out a json file with the detected reading order and layout.

surya_order DATA_PATH --images

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

Performance tips

Setting the ORDER_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 360MB of VRAM, so very high batch sizes are possible. The default is a batch size 32, which will use about 11GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 4.

From python

from PIL import Image
from surya.ordering import batch_ordering
from surya.model.ordering.processor import load_processor
from surya.model.ordering.model import load_model

image = Image.open(IMAGE_PATH)
# bboxes should be a list of lists with layout bboxes for the image in [x1,y1,x2,y2] format
# You can get this from the layout model, see above for usage
bboxes = [bbox1, bbox2, ...]

model = load_model()
processor = load_processor()

# order_predictions will be a list of dicts, one per image
order_predictions = batch_ordering([image], [bboxes], model, processor)

Limitations

Troubleshooting

If OCR isn't working properly:

Manual install

If you want to develop surya, you can install it manually:

Benchmarks

OCR

Benchmark chart tesseract

Model Time per page (s) Avg similarity (⬆)
surya .62 0.97
tesseract .45 0.88

Full language results

Tesseract is CPU-based, and surya is CPU or GPU. I tried to cost-match the resources used, so I used a 1xA6000 (48GB VRAM) for surya, and 28 CPU cores for Tesseract (same price on Lambda Labs/DigitalOcean).

Google Cloud Vision

I benchmarked OCR against Google Cloud vision since it has similar language coverage to Surya.

Benchmark chart google cloud

Full language results

Methodology

I measured normalized sentence similarity (0-1, higher is better) based on a set of real-world and synthetic pdfs. I sampled PDFs from common crawl, then filtered out the ones with bad OCR. I couldn't find PDFs for some languages, so I also generated simple synthetic PDFs for those.

I used the reference line bboxes from the PDFs with both tesseract and surya, to just evaluate the OCR quality.

For Google Cloud, I aligned the output from Google Cloud with the ground truth. I had to skip RTL languages since they didn't align well.

Text line detection

Benchmark chart

Model Time (s) Time per page (s) precision recall
surya 52.6892 0.205817 0.844426 0.937818
tesseract 74.4546 0.290838 0.631498 0.997694

Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a system with an A6000 GPU, and a 32 core CPU. This was the resource usage:

Methodology

Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.

I instead used coverage, which calculates:

First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.

Then we calculate precision and recall for the whole dataset.

Layout analysis

Benchmark chart

Layout Type precision recall
Image 0.95 0.99
Table 0.95 0.96
Text 0.89 0.95
Title 0.92 0.89

Time per image - .79 seconds on GPU (A6000).

Methodology

I benchmarked the layout analysis on Publaynet, which was not in the training data. I had to align publaynet labels with the surya layout labels. I was then able to find coverage for each layout type:

Reading Order

75% mean accuracy, and .14 seconds per image on an A6000 GPU. See methodology for notes - this benchmark is not perfect measure of accuracy, and is more useful as a sanity check.

Methodology

I benchmarked the layout analysis on the layout dataset from here, which was not in the training data. Unfortunately, this dataset is fairly noisy, and not all the labels are correct. It was very hard to find a dataset annotated with reading order and also layout information. I wanted to avoid using a cloud service for the ground truth.

The accuracy is computed by finding if each pair of layout boxes is in the correct order, then taking the % that are correct.

Running your own benchmarks

You can benchmark the performance of surya on your machine.

Text line detection

This will evaluate tesseract and surya for text line detection across a randomly sampled set of images from doclaynet.

python benchmark/detection.py --max 256

Text recognition

This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl (with synthetic data for missing languages).

python benchmark/recognition.py --tesseract

Layout analysis

This will evaluate surya on the publaynet dataset.

python benchmark/layout.py

Reading Order

python benchmark/ordering.py

Training

Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.

Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).

Thanks

This work would not have been possible without amazing open source AI work:

Thank you to everyone who makes open source AI possible.