[Feature Request] Support desktop agents

jmsdao commented 3 months ago

Executive summary

Inspect currently does not support desktop agents.

1) Many threat model relevant tasks require a GUI OS to complete 2) Benchmarks for desktop agents already exist 3) Desktop agents are a realistic threat model 4) Desktop agents require a virtual machine (ie. VirtualBox)

Quick example of desktop agents: see the video + graphical abstract of OSWorld.

Main arguments

1) Many relevant tasks require a GUI OS

You can't pass many forms of CAPTCHA without a GUI
Many websites block headless browsing eg. I think you can't create a Facebook account headless
Tasks like graphics design require a GUI (relevant for an building online storefront and certain freelancing jobs)
Many desktop apps do not have a CLI version

2) Benchmarks for desktop agents already exist

OSWorld benchmarked GPT, Gemini and Claude models against 369 open-ended computer tasks:
- The best model scaffold tested achieved 12.24% completion on the benchmark
- Example of a successfully completed task instruction: "I downloaded an episode of Friends to practice listening, but I don't know how to remove the subtitles. Please help me remove the subtitles from the video and export it as "subtitles.srt" and store it in the same directory as the video." Taken from Fig 9 of the paper
CRAB benchmarked GPT, Gemini and Claude models against 100 cross-environments tasks (Ubuntu desktop and Android smartphone)
- The best model scaffold tested achieved 14% completion on the benchmark
- Example of a successfully completed task instruction: "Please open the X app on my phone, search for CAMEL-AI.org, check the latest post, summarize them, and then send the summary to Tianqi Xu on Slack from my PC." Taken from a demo video on the CRAB website

3) Desktop agents are a realistic threat model

Desktop agents to will plausibly have faster adoption as the median consumer will find desktop agents to be much more intuitive and useful than say pure CLI agents
The "observation space" of a desktop agent is a superset of a pure CLI or browser agent
Big companies are working on desktop agents (or similar):
- Salesforce Research was involved with OSWorld
- Microsoft has UFO: A UI-Focused Agent for Windows OS Interaction

4) Desktop agents require a virtual machine

OSWorld's codebase currently supports local VMs via VirtualBox and VMWare, and cloud VMs via AWS, GCP and Azure
CRAB's codebase currently supports local VMs via Linux KVM, and cloud VMs via GCP. The Android environment is provided by Andrdoid Studio

What's missing from Inspect

Virtual machines as a sandbox option
Gracefully waiting for a sandbox (whether Docker or VM) to be ready before beginning the eval
Built-in solvers, tools, agents, scorers for a desktop agent setups

Next steps

My team and I could potentially implement this feature if there's interest and excitement to do so! We’re excited to see evals go in the direction of desktop environments as we think this is highly consequential for the field of AI safety.

jjallaire commented 1 month ago

@jmsdao As we've discussed we aren't going to make this a full on Inspect feature at this point. That said, I know you are working on an extension and it might be nice to post a link to that work here for those reading this issue.

jmsdao commented 1 month ago

I'll be sure to do so once it's ready!

UKGovernmentBEIS / inspect_ai