When predicting an action that involves a target element, the model should output sufficient details to
locate the target UI element, either an index or its geometric information such as its bounding rectangle.
To reduce the complexity of parsing the model output, we prompt an LLM to output its action selection
in a predefined JSON format. In the case of element click, for instance, the model outputs a prediction
in the following format: {"action_type":"click","x":,"y":}, where the
target element is identified by its center coordinates. We found LLMs work equally well with
predicting element centers or element indices, but as the former approach is compatible with click
actions that are not restricted to specific UI elements, our implementation always outputs the center
of the target UI element. The same applies to all actions that take an element as input.
We have not tried modifying the accessibility tree to include centroids. We just use the bounding box as provided by the API.
This task involves modifying WindowEvent.to_prompt_dict to replace bounding boxes with centroids.
Feature request
From https://arxiv.org/pdf/2406.03679:
We have not tried modifying the accessibility tree to include centroids. We just use the bounding box as provided by the API.
This task involves modifying
WindowEvent.to_prompt_dict
to replace bounding boxes with centroids.Motivation
No response