All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
33.27k stars 3.81k forks source link

[Bug]: Does the Browsing Agent need run_ipython action? #4355

Open enyst opened 2 weeks ago

enyst commented 2 weeks ago

Is there an existing issue for the same bug?

Describe the bug and reproduction steps

Running Browsing Agent with Deepseek, I got a syntax error, and it turns out that what the LLM was trying to do is ... not necessarily "wrong", but we're not ready for it. Maybe we can address this by teaching our browsing agent the run ipython action? Or can we just send it like this to browsergym?

LLM response:

To summarize the quality and cost information of various language models from the blog article, I need to extract the relevant details and add them to the file. I'll start by summarizing the key points from the article.

send_msg_to_user('Summarizing the quality and cost information of various language models from the blog article...')

# Summarize the key points
summary = """
Key takeaways from the evaluation of LLMs as coding agents on SWE-Bench:

1. Clauder 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.
2. GPT-4o lags behind, and o1-mini performed somewhat worse than GPT-4o. Analysis suggests o1 was sometimes "overthinking" tasks, performing extra environment configuration.
3. The strongest open models were Llama 3.1 405B and deepseek-v2.5, which performed reasonably well, even besting some closed models.
4. The price/accuracy tradeoff is an important consideration. A chart of this tradeoff is provided in the article.
"""

# Add the summary to the file
send_msg_to_user(summary)```

The response did include "```python" too.

Result with our current implementation, which doesn't expect variables in send_msg_to_user:

Traceback (most recent call last):
....
  File "/Users/enyst/repos/odie/openhands/agenthub/browsing_agent/response_parser.py", line 42, in parse_action
    return self.default_parser.parse(action_str)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/enyst/repos/odie/openhands/agenthub/browsing_agent/response_parser.py", line 108, in parse
    msg_content = args[0].value
                  ^^^^^^^^^^^^^
AttributeError: 'Name' object has no attribute 'value'

OpenHands Installation

Development workflow

OpenHands Version

No response

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

enyst commented 2 weeks ago

Curiously, the LLM seems to be actually right: it has been told in the prompt that its action space includes:

send_msg_to_user(text: str)
    Examples:
        send_msg_to_user('Based on the results of my search, the city was built in 1751.')

and then:

Multiple actions can be provided at once, but will be executed sequentially without any feedback from the page.
More than 2-3 actions usually leads to failure or unexpected behavior. Example:
fill('a12', 'example with "quotes"')
click('a51')
click('48', button='middle', modifiers=['Shift'])

So... it seems fair that the LLM may decide to send 2x send_msg_to_the_user? 😅

enyst commented 2 weeks ago

Example of an interesting response, just for thought:

To summarize the quality and cost information of various language models, I need to extract the relevant details from the text and the chart. I will start by extracting the text information and then focus on the chart.

# Extract text information
text_info = [
    "Clauder 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.",
    "GPT-4o lags behind, and o1-mini actually performed somewhat worse than GPT-4o. We went in and analyzed the results a little, and briefly it seemed like o1 was sometimes 'overthinking' things, performing extra environment configuration tasks when it could just go ahead and finish the task.",
    "Finally, the strongest open models were Llama 3.1 405 B and deepseek-v2.5, and they performed reasonably, even besting some of the closed models."
]

# Extract chart information
# Since the chart is an image, I will need to click on it to get more details
click('112'

Note: The task was to summarize the quality and cost of LLMs. The agent wants both the text - which it summarized in text_info, and then, it wants, at next step, to see the chart. But the way the browsing agent currently works, it will lose text_info at the next step, because it never includes anything else from step to step, but the commands (e.g. only click('112')). Cc: @ketan1741

Thus in my test, this lead to a lot of time trying and losing and trying and losing, and ended in a stuck in a loop.

ketan1741 commented 2 weeks ago

But the way the browsing agent currently works, it will lose text_info at the next step, because it never includes anything else from step to step, but the commands (e.g. only click('112')).

Yes, that's exactly how it works right now. We should look into ways to improve it. We could include at least the previous one or two observations, thoughts+action, for the next step.