OpenAdaptAI / OpenAdapt

AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
745 stars 99 forks source link

Handle similar segments #679

Closed abrichr closed 3 weeks ago

abrichr commented 1 month ago

Related: https://github.com/OpenAdaptAI/OpenAdapt/issues/692

Work in progress

excel image

image

image image

nms.py:

image

Using accessibility data (only available for Excel on Windows):

image
KrishPatel13 commented 1 month ago

@abrichr , I would like to know more about the fix in this PR.

I have understood that the VisualReplayStrategy makes duplicate segmentation. Hence, we want to fix it. Correct me if I have thought incorrect.

Moreover, I need more clarification about the 2 TODOs. What do we want by sliding window and nms (does this mean non-maxima supression) ?

Thank you.

abrichr commented 1 month ago

Thanks for your interest @KrishPatel13 !

I have understood that the VisualReplayStrategy makes duplicate segmentation. Hence, we want to fix it. Correct me if I have thought incorrect.

This is actually a separate issue tracked in https://github.com/OpenAdaptAI/OpenAdapt/issues/614.

Moreover, I need more clarification about the 2 TODOs. What do we want by sliding window and nms (does this mean non-maxima supression) ?

There are two separate but related issues here:

  1. The way that the VisualReplayStrategy works is to first have the model describe the available segments, then modify the actions based on the user's instructions and the segment descriptions. A problem arises if there are multiple visually similar segments (e.g. cells in a spreadsheet). Initially I was thinking that this could be solved by prompting the model to describe the differences between similar segments. Another more deterministic way to solve this is to include "grid landmarks" in each segment description. That is, simply include in the element's description the description of other elements that are aligned with it horizontally and vertically. This would implicitly solve the problem of distinguishing between cells whose only difference relates to their relative position (e.g. cell A1 vs A2, etc.)

  2. A single run of any segmentation model will not pick up every segment. Sliding Window + Non-Maximum Suppression is a possible solution (used in https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). I have created a new issue for this here: https://github.com/OpenAdaptAI/OpenAdapt/issues/695.

Which of the three issues seems most interesting to you?