OSU-NLP-Group / SeeAct

[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
https://osu-nlp-group.github.io/SeeAct/
Other
571 stars 69 forks source link

som branch, bounding boxes on out-of-reach elements #41

Open mlin12321 opened 2 months ago

mlin12321 commented 2 months ago

For the som branch, some elements are visible to SeeAct via the html but not on the browser window (for instance, product categories being obscured by search results). However, bounding boxes are still attached to the obscured elements, causing SeeAct to believe it can click on elements that are not there. This can cause timeout issues, especially if SeeAct mistakes the bounding box on the obscured element for the elements on top of it (e.g. a search result covers a category tab, and SeeAct believes the bounding box for the category tab is for the search result).

duz-sg commented 1 month ago

Did some searching and found this issue: https://github.com/microsoft/playwright/issues/9923 And yes, indeed it seems there is no way for us to tell, if there are two elements, one blocking another, which one is clickable. From playwright's point of view, they are both normal elements.