OSU-NLP-Group / SeeAct

[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
https://osu-nlp-group.github.io/SeeAct/
Other
571 stars 69 forks source link

<div> as an interactive element #38

Closed MeLoveCarbs closed 2 months ago

MeLoveCarbs commented 2 months ago

Hi, first of all thank you so much for developing this amazing web agent. After looking through the code, I do have a question. It seems that <div> tag is filtered out from interactive element and occasionally, some <div> buttons isn't pressed because it isn't retrieved as an interactive element. I also understand that there is a huge amount of div tags in modern webpages, could this be the reason why it is left out?

boyugou commented 2 months ago

Overall, it's not easy to correctly and efficiently filter every interactive element. That's why we maintained a list of possible interactive element selectors.

I did try to include every element before, but it turns out GPT often chooses to click on pure text elements.

If there is a better interactive element retrieving strategy, feel free to tell us or make a PR. Thanks for your understanding.

MeLoveCarbs commented 2 months ago

That makes sense, I will help out and look into it too. Closing this issue for now.