OpenAdaptAI / OpenAdapt

AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
739 stars 98 forks source link

Modify `WindowEvent.to_prompt_dict` to include computed centroid #725

Open abrichr opened 3 weeks ago

abrichr commented 3 weeks ago

Feature request

From https://arxiv.org/pdf/2406.03679:

When predicting an action that involves a target element, the model should output sufficient details to locate the target UI element, either an index or its geometric information such as its bounding rectangle. To reduce the complexity of parsing the model output, we prompt an LLM to output its action selection in a predefined JSON format. In the case of element click, for instance, the model outputs a prediction in the following format: {"action_type":"click","x":,"y":}, where the target element is identified by its center coordinates. We found LLMs work equally well with predicting element centers or element indices, but as the former approach is compatible with click actions that are not restricted to specific UI elements, our implementation always outputs the center of the target UI element. The same applies to all actions that take an element as input.

We have not tried modifying the accessibility tree to include centroids. We just use the bounding box as provided by the API.

This task involves modifying WindowEvent.to_prompt_dict to replace bounding boxes with centroids.

Motivation

No response