Provide more information inside object detection bounding boxes

jeffbl commented 2 years ago

Based on discussion with @dreamseed87 @johnnyvenom @florian-grond @cyan.

For a device like the Dot pad, we can provide raised dots in rectangles representing different objects, and indicate what they are via audio, or perhaps the braille line at the bottom of the dot pad. This work item is to figure out what other information could be provided within each bounding box, to provide more detail. For example:

contour finding around the object found.
thresholding via posterization to find patterns in light/dark pixel areas, providing more shape detail (and maybe contour information, depending on image?)

Examples:

A bounding box might find a horse, but can't represent which way it is facing, or the idea that it has 4 very long legs,.
A bounding box finding a long item on a diagonal would not help reveal the orientation.

We can imagine many practical difficulties here with either approach, but wanted to understand the likely difficulty of this task. The current thinking is that the basic bounding box approach is one level haptics can implement with, but having more detail/shape within the bounding boxes would take things further.

Note that semantic segmentation should already be returning a structured outline/contour once #130 is completed, but as of now, it only operates on photographs classified as "outdoor".

Assignign to @dreamseed87 for clarification/more details based on conversation, and @gp1702 for evaluation as a new preprocessor enhancement work item.

dreamseed87 commented 2 years ago

As @jeffbl mentioned, what haptics team want to present is "how it appears." So, bounding boxes may not provide somewhat detailed shape or tactile graphics; just indicating "here the item is." If contour of object, or OBB (oriented bounding box) can be detectable by pp these would be useful for haptic rendering. Of course, considering haptic rendering's resolution, it is okay to have a "rough" contour as well. (smoothening may work.) (Saliency detection inside the segment would be helpful?) -- @gp1702 whenever you are available we may chat shortly to clarify this.

jeffbl commented 2 years ago

@dreamseed87 Is this still valid as-is, or has our thinking evolved since this was logged last year? Should we pull this from backlog as part of IGNITE, and if so, can we get some more specific examples to implement against?

dreamseed87 commented 2 years ago

Oh, totally slipped in my head... I believe the tacton approach we made partially covers this topic. (With a nice design of tactons from @johnnyvenom.) Anyhow, good to pull some elements of this as part of IGNITE. As specific examples, we may categorize this problem threefold:

The object orientation might distort the idea on object size (especially long, diagonal objects e.g., chopsticks, umbrellas.)
The perspective is important; e.g., a dog's front view vs. side view. "Blind touches an elephant" problem.
Extremely small size of object may not be presented correctly on dot pad with a bounding box (e.g., 20 pixel-large backpack.) - results of a single or only a few pixels are raised up. (BTW, contour (from semseg) drawing works well on dot pad.)

Possible solutions would be:

using/implementing oriented bounding box. However, it needs a change of schema or additional preprocessor work that figures out OBB's direction and size.
Probably the most difficult one for pp team. That need to extract some high-level description of objects with perspectives, e.g., "a face of dog, facing the user," or "side view of a running dog."
(Already, where tactons works well.) Probably optimization and diversify/give some intuition for tactons would be required.

Expected level of difficulty would be 3 (easiest) <= 1 << 2 (hardest), from my thought.

Cybernide commented 2 years ago

Some questions: What do you mean by "oriented bounding box" in solution 1? For solution 3, are you proposing some documentation of some kind? A link to "how to interpret this" like we have currently with audio?

dreamseed87 commented 2 years ago

For 1: see the example picture in https://stackoverflow.com/questions/40404031/drawing-bounding-box-for-a-rotated-object , or search images with "oriented bounding box." It means bounding box with lotation (not axis-aligned). For 3, it's really an open question, but I believe several points we can improve. For example, 1) we may add some more tactons to cover more items (definitely it should be identifiable). 2), probably it is the most important, in @johnnyvenom 's implementation we used alias between tacton and actual detected object (e.g., "earth" mapped with a circle tacton, "traffic lights" mapped with a hourglass tacton). Audio description for this alias may help the users.

Cybernide commented 2 years ago

Right, but 2 is likely to be the hardest to provide more detail for, as we're bound by the limitations of ML. I say we work on either 1 or 3 as it's looking like we have a much leaner ML team this time around.

dreamseed87 commented 2 years ago

Yes, that's what I am saying as well :) 2 is overkill for IGNITE. 1 or 3 should be fine.

Shared-Reality-Lab / IMAGE-server

Provide more information inside object detection bounding boxes #210