JaidedAI / EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
https://www.jaided.ai
Apache License 2.0
23.9k stars 3.12k forks source link

[Feature Request] Hierarchical output (e.g. with paragraphs) #861

Open athewsey opened 2 years ago

athewsey commented 2 years ago

Me again with another suggestion πŸ˜„,

When trying out paragraph=True I saw that the results completely merge the detected text & boxes - and couldn't see any way to map back to the original words themselves.

Word-level bounding boxes can be really useful for applying layout-aware NLP models (e.g. LayoutLM, DocFormer, etc) down-stream. As far as I can tell today, a user would need to run EasyOCR twice if they wanted to produce both word-level and paragraph-level results - and wouldn't be able to map objects between the two?

It would be great to have some kind of hierarchical output option, to be able to group detections without losing the low-level information. For example I believe Tesseract does this today already, although I can't say I found their TSV records structure that easy to get started with either! Nested objects might be nicer to iterate through.

Disclosure

I currently work at AWS (but helping our customers build solutions, not building AWS services themselves), and am also a regular user of Amazon Textract... So am not intending to unduly steer your design in whatever way, but might be biased by what I'm familiar with using! πŸ™‡

rkcosmos commented 2 years ago

Sounds good. Thanks for the suggestion. We will add this feature.

emanuelevivoli commented 1 year ago

Hello and compliments for the repo! Any progress about this feature?

tevfikaktay commented 7 months ago

@emanuelevivoli maybe the quickest solution for this now adding get_paragraph() function (in the utils.py) to your own pipeline after getting results. Pls be sure to set defaults for its parameters if you changed.