Implement text OCR preprocessor

jeffbl commented 3 years ago

As we've seen with the charts preprocessor, text in images can be crucial for understanding, e.g., text on top of / next to wedges in a pie chart. However, the charts model does not currently extract this information, so although the user can know the percentage of different wedges, they have no idea what the wedges represent.

Task is to explore options for extracting text from images, and aligning/linking it with information from other preprocessors that extract locations of objects from graphics, like object detection or the charts preprocessor. Is this even feasible?

Note that for exploration with a haply, just adding the ability to read out text areas when hitting them may be enough in some cases, since proximity to other layers (like the wedges of a pie chart) may be enough to extract relevant meaning.

This capability may also be key for things like diagrams from textbooks, like the solar system graphic from the NFB demos.

Cybernide commented 3 years ago

Something that we're hearing again and again on the UX side is that people want to know if there's text or labels on images. It's also becoming clear that all the audio work on charts isn't going to be compelling unless we are able to get labels for data. I want to push this issue to the forefront because it's been something that users consistently mention.

jeffbl commented 3 years ago

@Cybernide Note that specifically for plots and charts, we'll have to discuss the priority of this vs. #129, which would only work for some charts, but wouldn't have the same accuracy issues we're likely to encounter with trying to extract it from the graphics. But of course it also will not be general beyond charts/plots.

Cybernide commented 3 years ago

@Cybernide Note that specifically for plots and charts, we'll have to discuss the priority of this vs. #129, which would only work for some charts, but wouldn't have the same accuracy issues we're likely to encounter with trying to extract it from the graphics. But of course it also will not be general beyond charts/plots.

Sure - I think that having accurate chart data alone would make a strong case for adopting the extension. However, at this moment, feedback is increasingly suggesting that users would eagerly adopt our extension if it provided the words in images. As to what to prioritize, we'll try to figure it out.

jeffbl commented 2 years ago

After discussion this morning investigation should include:

[ ] What Azure can do for OCR
[ ] Are there other OCR options worth investigating, that would do better than Azure?

After investigation, and depending on findings, generate new separate work items for creating initial preprocessor then building up from there.

Once we have chosen a solution (Azure or something else that looks more promising):

[ ] Create new preprocessor that simply gets all text in a graphic and passes it to handlers.
[ ] Expand preprocessor to get regions of text with coordinates (if feasible given tools available)

Once we can find text in regions, new functionality it possible, e.g.:

[ ] combining this text with other information, such as items from object detection (e.g., if text is found in a bounding box inside an object tagged as "sign" we can probably do more interesting things.

@gp1702 am I forgetting anything from this morning?

@Cybernide @Sabrina-Knappe Anything to add in terms of key issues for investigation as Ben and Aidan go forward on OCR?

jeffbl commented 2 years ago

Moving out of Dec14, but would like to see an update here on initial investigation/progress on this before the new year?

Cybernide commented 2 years ago

I'm going to add issues as I think of them here:

The problem of information density - I can foresee for example, a graphic of multiple street signs in an area such as Times Square New York having a LOT of detected text, or say, graphics of protests with the same slogan on t-shirts of multiple individuals in a crowd.

florian-grond commented 2 years ago

having a LOT of detected text, this is the same as having a lot of the same kind of objects in a picture, this means you need to get parameters from the ML side that allow you to prioritize / order / playback louder or softer etc.

BenMacnaughton commented 2 years ago

Useful docs:

https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/client-library?tabs=visual-studio&pivots=programming-language-python

https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/vision-api-how-to-topics/call-read-api

BenMacnaughton commented 2 years ago

The Azure read API provides bounding boxes which should allow us to identify text contained within objects identified by other preprocessors

jeffbl commented 2 years ago

@BenMacnaughton @aidanwilliams09 I'm moving this to Jan31, but also changing title to "implement OCR preprocessor", rather than logging a new issue. Acceptable?

BenMacnaughton commented 2 years ago

Sounds good @jeffbl

Shared-Reality-Lab / IMAGE-server

Implement text OCR preprocessor #114