OSU-NLP-Group / SeeAct

[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
https://osu-nlp-group.github.io/SeeAct/
Other
571 stars 69 forks source link

possibly dumb action space ideas #18

Open keeganpoppen opened 4 months ago

keeganpoppen commented 4 months ago

hey-- first off, really, really cool project!

first off, are all actions keyed off a specific element in the list, or is there some way to conduct certain actions without a target (bc they are, e.g. keyboard shortcuts). the motivating example here is gmail (yes, i know logging into gmail and opening & reading emails is a semi-off-brand usage, but bear with me...). for whatever reason, google, in their infinite wisdom, refuses to have individual emails in the table register as interactable components in any easily-divinable way. i assume this is in no small part to stymie scraper agents such as my own, and may incidentally have as much to do with user agent / scraper detection as it does anything else (gmail at least acts like the chromium browser is some sort of screen reader device, but afaict is still completely useless to that end. iono, i dont have one handy). but even with keyboard shortcuts enabled, and the model being instructed to, and properly attempting to, use the keyboard shortcut interface, the actions don't materialize, presumably because actions are things that are done to specific elements, not in general?

i ask about this because i did think of a cheeky way to get around this issue / a way to collect essentially every user-interactable element on a website, whether the website creator wants you to or not: cursor: pointer. yes, you would have to run a js snippet in the browser to getComputedStyle() on every page in order to do this, but it does feel a bit subversive in that it sort of duck-types what an interactable element on a page is: one that tells you that you can click on it! no amount of ridiculous div-ing and javascript-ing and fake-link-ing and background-image-ing can really get around that! i think you could view this list as strictly additive, and then de-deupe it with the existing list, as if, say, we already know something is a link (anchor tag), we don't need to also throw it in the "clickable" bucket. but it would really open up opportunities for telling the system to "click on the subtitle for the second email in the list" or whatever, especially with tree-of-thoughts-y staging and/or other prompting techniques. ANYWAY, do you think this idea makes any sense? i realize architecturally right now you only parse the xml, but that is really kinda fraught in its own right anyway compared to evaluating stuff in the browser, given that you're already running a browser.

i have also been playing around with ways to "multi-modal-ize" this project a bit, as it seems that gpt-4v and bretheren are not as good general reasoners as ... not-*v versions (which intuitively makes sense)... but the process of accomplishing this seems a bit fraught. i feel like you might be able to have seeact do some sort of summarization of the current screen (possibly in comparison to the previous screen as well), and send that to some other agent that handles which action to conduct next, but in order for this to have any chance of working it seems like having a more detailed page segmentation / labeling step would be required, and that problem is at least as difficult as the one that i'm trying to use it to solve xD... turtles all the way down... but if you have any advice on how this might work, or why the dream of having a much wider action space (click anywhere, press whatever key, etc.) doesn't actually work in practice (at the very least it might need to be extremely, extremely slow), but it feels like if one were to somehow combine a model that describes the semantics of certain classes of webpages & how they tend to work (tables are sometimes clickable, settings is usually in the top right, email programs have tables of emails called "inboxes", ...) + image segmentation + vision + actuation a la Playwright + reasoning + ... magic?, maybe all the added complexity would eventually yield some multiplicatively useful result?