OSU-NLP-Group / SeeAct

[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
https://osu-nlp-group.github.io/SeeAct/
Other
571 stars 69 forks source link

No browser event when set experiment_split="element_attribute" in generate_prompt in seeact.py #14

Closed cc13qq closed 6 months ago

cc13qq commented 7 months ago

I tried to use element_attribute for prompt generation when running the SeeAct demo. The prompt generation is correct but no browser event is executed.

boyugou commented 7 months ago

Yeah, you are right. SeeAct v0.1.0 only supports the text-choice grounding strategy.

We have not added the other two grounding strategies in SeeAct v0.1.0, since they were not working well in our offline experiments. We will support that and OSS models in later versions. (We are still busy for some other things. Sorry for this)

And by the way, SeeAct is easy to expand, so you can also try something like combining different grounding strategies and input information (We will also add some expansions like this in later updates.). We did many ablation studies like combining text_choice and image_annotation at the start of the project.

I've seen people doing such things in recent papers. For example

And we will definitely make more expansions other than these to build better web agents. Stay tuned.