biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.79k stars 997 forks source link

OCR (optical character recognition) in images #6642

Closed simonaubertbd closed 9 months ago

simonaubertbd commented 10 months ago

What's your use case? I would like to extract text in image/pdf document.

What's your proposed solution? An OCR module in the image add-on to extract text. Some modules exists in python for that. Some documentation here (well; sadly in french https://www.aranacorp.com/fr/reconnaissance-de-texte-avec-python/) and what I understand is that you could use pytesseract.

Are there any alternative solutions? No

janezd commented 9 months ago

Would this be really useful? How does one get a large collection of QR codes that he would like to text-mine?

We discussed this at the meeting on Friday. On the first glance, it could belong to the image analytics add-on, but it is rather about texts, so it could be in text add-on. However, our resources are limited, and the QR widget had no advocates in the group. If you wish to have it and you consider it worth of your effort, go for it. At first, your code would be accepted to the prototype add-on; if it's useful, it can the move to its more appropriate permanent place.

simonaubertbd commented 9 months ago

Hello @janezd Thanks for your constructive feedback. The common use case is folders full of pdf document (generally cashier's receipt that are scanned, send by employees and the accountant have to checked it). Another use case can be libraries and research in fields like literature.... I saw it more as an image analytics add-on, a way to get an information in a picture than a text add-on. By the way, it's on my todo list to learn how to develop widget for orange data mining next year. Best regards, Simon