Open athewsey opened 2 years ago
Sounds good. Thanks for the suggestion. We will add this feature.
Hello and compliments for the repo! Any progress about this feature?
@emanuelevivoli maybe the quickest solution for this now adding get_paragraph() function (in the utils.py) to your own pipeline after getting results. Pls be sure to set defaults for its parameters if you changed.
Me again with another suggestion π,
When trying out
paragraph=True
I saw that the results completely merge the detected text & boxes - and couldn't see any way to map back to the original words themselves.Word-level bounding boxes can be really useful for applying layout-aware NLP models (e.g. LayoutLM, DocFormer, etc) down-stream. As far as I can tell today, a user would need to run EasyOCR twice if they wanted to produce both word-level and paragraph-level results - and wouldn't be able to map objects between the two?
It would be great to have some kind of hierarchical output option, to be able to group detections without losing the low-level information. For example I believe Tesseract does this today already, although I can't say I found their TSV records structure that easy to get started with either! Nested objects might be nicer to iterate through.
Disclosure
I currently work at AWS (but helping our customers build solutions, not building AWS services themselves), and am also a regular user of Amazon Textract... So am not intending to unduly steer your design in whatever way, but might be biased by what I'm familiar with using! π