OpenAdaptAI / OpenAdapt

AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
738 stars 98 forks source link

Avoid unnecessary segmentation + description in `VisualReplayStrategy` #614

Open abrichr opened 2 months ago

abrichr commented 2 months ago

Feature request

https://github.com/OpenAdaptAI/OpenAdapt/pull/610 introduced the VisualReplayStrategy which works by segmenting the active window for every mouse event.

This is wasteful because some or all of the active window may not change between mouse events.

We would like to implement the following optimization:

1. Store the segmentation retrieved in https://github.com/OpenAdaptAI/OpenAdapt/pull/610/files#diff-4123d48b6e604812e5bbba6507183956b05038539947eedfd02a7e475344cbc5R313 (i.e. the Segmentation object) in the database. Implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L178.

2. During replay, in the VisualReplayStrategy, find the active window screenshot that is most similar to the current active window, e.g. using https://github.com/JohannesBuchner/imagehash. (Retrieve all Screenshots for the recording, and extract the active window with https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L315.) Implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py#L409

  1. Extract the portion of the active window that is different (i.e. greater than some tolerance) from the window retrieved in step 2, and segment + describe only this portion, rather than the full thing. Then recombine the new segments with the unchanged segments from the window retrieved in step 2.

Note: in the case of the calculator example, the only difference will be the text containing the number at the top of the window. This will be removed in vision.refine_masks, which means that there will be nothing more to describe, and we can re-use the previous Segmentation and descriptions. Therefore, this will be working when, during the calculator example, we only need to get descriptions once, for the first action.

Motivation

VisualReplayStrategy is very slow.

abrichr commented 2 months ago

ChatGPT:

For the task of finding similar UI images, here are comparisons of the three libraries:

FAISS (Facebook AI Similarity Search):

Image-Similarity-Measures:

ImageHash:

In summary:

For your specific use case of finding similar UI images, if you're dealing with a large database of images and you need the performance, FAISS is a strong candidate. If the dataset is smaller and the task is more about detecting near-duplicates based on structural similarity, ImageHash is a more appropriate choice. Image-Similarity-Measures could be a supplementary tool for providing additional verification but is less suited for database operations.

Edit: Structural Similarity Index (SSIM) implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py#L409