OpenAdaptAI / OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
883 stars 115 forks source link

Implement vanilla baseline #700

Closed abrichr closed 4 months ago

abrichr commented 4 months ago

Feature request

And I’m guessing you’ve tried the vanilla baseline: Send in a series of screenshots to GPT-4 and then ask GPT-4 to describe what happened. Then give it the sequence of actions (in concrete coordinates and keyboard inputs), as well as your proposed modification in natural language. Ask it to output the new action sequence --@LunjunZhang

Motivation

I’m just making sure that before we delve into improving segmentation and constructing graphs (those tasks can be deep rabbit holes), we have a baseline set up --@LunjunZhang

abrichr commented 4 months ago

We can also use Gemini.

Edit: We have not attempted this since before large video context were available.

abrichr commented 4 months ago

I feel that it’d be good to:

  1. have one baseline that’s almost all just about prompting multi-modal API to do the task. This might get 0% success rate, but having this script will allow us to evaluate any new model that comes out. (The point is that, if AGI or GPT6 happens, this script should be able to suddenly do the work).
  2. Then add additional techniques like Set-of-Mark prompting (if vision is the bottleneck), that’s almost like 1, without explicit structure. Maybe this doesn’t work at all either.
  3. Then we diagnose the main bottleneck (is it vision? or planning? or RAG? or outcome determination?), and then add more structure strategically --@LunjunZhang