OthersideAI / self-operating-computer

A framework to enable multimodal models to operate a computer.
https://www.hyperwriteai.com/self-operating-computer
MIT License
8.68k stars 1.15k forks source link

[FEATURE] Decouple screenshot capture and action prediction #156

Open mjspeck opened 7 months ago

mjspeck commented 7 months ago

Is your feature request related to a problem? Please describe.

All the code here is tightly coupled, making it difficult to modify and test. Mainly, it's not easy to separate out the screen capture from the LLM call.

Describe the solution you'd like

My solution would be to refactor the code so that there is an Agent class that can call at get_next_action method which takes as input both a screenshot and the prompt/messages. This conforms to the basic MDP model that all RL follow and would allow for easier testing by, for example, allowing a developer to pass a dataset of screenshot, prompt pairs and seeing whether the agent generates the correct next action.

This would require a fair bit of modification to the def main and def operate functions at the very least.