Avoid unnecessary segmentation + description in `VisualReplayStrategy`

Feature request

https://github.com/OpenAdaptAI/OpenAdapt/pull/610 introduced the VisualReplayStrategy which works by segmenting the active window for every mouse event.

This is wasteful because some or all of the active window may not change between mouse events.

We would like to implement the following optimization:

1. Store the segmentation retrieved in https://github.com/OpenAdaptAI/OpenAdapt/pull/610/files#diff-4123d48b6e604812e5bbba6507183956b05038539947eedfd02a7e475344cbc5R313 (i.e. the Segmentation object) in the database. Implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L178.

2. During replay, in the VisualReplayStrategy, find the active window screenshot that is most similar to the current active window, e.g. using https://github.com/JohannesBuchner/imagehash. (Retrieve all Screenshots for the recording, and extract the active window with https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/models.py#L315.) Implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py#L409

Extract the portion of the active window that is different (i.e. greater than some tolerance) from the window retrieved in step 2, and segment + describe only this portion, rather than the full thing. Then recombine the new segments with the unchanged segments from the window retrieved in step 2.

Note: in the case of the calculator example, the only difference will be the text containing the number at the top of the window. This will be removed in vision.refine_masks, which means that there will be nothing more to describe, and we can re-use the previous Segmentation and descriptions. Therefore, this will be working when, during the calculator example, we only need to get descriptions once, for the first action.

Motivation

VisualReplayStrategy is very slow.

ChatGPT:

For the task of finding similar UI images, here are comparisons of the three libraries:

FAISS (Facebook AI Similarity Search):

Purpose: FAISS is designed for efficient similarity search and clustering of dense vectors. It excels at searching for nearest neighbors in large datasets.
Approach: It uses vector quantization and inverted file indexing to achieve fast and memory-efficient searches.
Use Cases: Best for datasets where you can represent items (e.g., images) as vectors in a high-dimensional space. Commonly used in conjunction with deep learning models where images are represented by feature vectors.
Scalability: Highly scalable and can handle very large datasets, with GPU acceleration for even faster processing.
Integration: Requires integrating with deep learning libraries to first convert images into feature vectors before they can be indexed and searched.
Complexity: More complex to set up and use compared to ImageHash. Requires knowledge of vector space and possibly machine learning concepts.

Image-Similarity-Measures:

Purpose: This library provides a set of measures to calculate the similarity between two images using classical image processing techniques.
Approach: Includes a variety of similarity measures such as Structural Similarity Index (SSIM), Mean Squared Error (MSE), and others.
Use Cases: Suitable for comparing two images directly with one another without the need for a database. It's for scenarios where the comparison is pairwise and not against a large corpus of images.
Scalability: Does not inherently include indexing or database management features, so it's not aimed at scalability for large image databases.
Integration: Can be used as a standalone for direct image comparisons or integrated into a database system where each query requires a full scan of the dataset.
Complexity: Relatively easy to use for calculating direct similarity measures between images but lacks the infrastructure for quick retrieval from large datasets.

ImageHash:

Purpose: Designed specifically for creating hash representations of images that can be used to determine if two images are visually similar.
Approach: Uses algorithms like average, perceptual, difference, and wavelet hashing to create hashes that are robust to minor variations in images.
Use Cases: Ideal for applications where the objective is to detect duplicate or near-duplicate images, such as deduplicating a photo collection or finding similar items in a catalog.
Scalability: Can be used with databases to store and index hashes for moderate-sized datasets. Scalability is limited by the database's ability to handle the hash comparison operations.
Integration: Easy to integrate into systems already using Python and can be paired with any standard database.
Complexity: Relatively straightforward to implement, with a focus on hash-based similarity that is less computationally intensive than feature extraction methods.

In summary:

FAISS is your go-to for large-scale, feature-based similarity searches, particularly when using deep learning features.
Image-Similarity-Measures is useful for direct, pairwise comparison using traditional image processing techniques and is not intended for database-driven applications.
ImageHash is best for straightforward, hash-based image similarity checks within moderate-sized datasets and where the visual similarity is defined by layout and structure rather than high-dimensional features.

For your specific use case of finding similar UI images, if you're dealing with a large database of images and you need the performance, FAISS is a strong candidate. If the dataset is smaller and the task is more about detecting near-duplicates based on structural similarity, ImageHash is a more appropriate choice. Image-Similarity-Measures could be a supplementary tool for providing additional verification but is less suited for database operations.

Edit: Structural Similarity Index (SSIM) implemented in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py#L409

OpenAdaptAI / OpenAdapt