Is your feature request related to a problem? Please describe.
All the code here is tightly coupled, making it difficult to modify and test. Mainly, it's not easy to separate out the screen capture from the LLM call.
Describe the solution you'd like
My solution would be to refactor the code so that there is an Agent class that can call at get_next_action method which takes as input both a screenshot and the prompt/messages. This conforms to the basic MDP model that all RL follow and would allow for easier testing by, for example, allowing a developer to pass a dataset of screenshot, prompt pairs and seeing whether the agent generates the correct next action.
This would require a fair bit of modification to the def main and def operate functions at the very least.
Is your feature request related to a problem? Please describe.
All the code here is tightly coupled, making it difficult to modify and test. Mainly, it's not easy to separate out the screen capture from the LLM call.
Describe the solution you'd like
My solution would be to refactor the code so that there is an
Agent
class that can call atget_next_action
method which takes as input both a screenshot and the prompt/messages. This conforms to the basic MDP model that all RL follow and would allow for easier testing by, for example, allowing a developer to pass a dataset of screenshot, prompt pairs and seeing whether the agent generates the correct next action.This would require a fair bit of modification to the
def main
anddef operate
functions at the very least.