Code to evaluate WebArena

InternLM / Agent-FLAN

[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

https://internlm.github.io/Agent-FLAN/

Apache License 2.0

316 stars 9 forks source link

Code to evaluate WebArena #13

Open shuyanzhou opened 3 months ago

shuyanzhou commented 3 months ago

Hi,

Thanks for the great work. I am wondering if you have plans to release the code to run WebArena?

zehuichen123 commented 3 months ago

Hi, We directly adopt evaluation code from AgentTuning :)

shuyanzhou commented 3 months ago

Thank you for the response, but I am wondering if you perform multi-turn prompting to get one action?

zehuichen123 commented 3 months ago

During inference, we directly adopt the JSON format output or any format requested in the system prompt. The chat format data is used for training only.

shuyanzhou commented 3 months ago

Thank you very much for the info. We attempted to reproduce the result with the default prompt, but the SR is only 0.61%. Would you mind sharing the recorded trajectories so that we can compare what may go wrong from our end.

wang-qiuchen commented 3 months ago

Hello, our project was evaluated in January 2024, and you might need to switch to an earlier official version https://github.com/web-arena-x/webarena/commit/14f91d90e60d79e829396d6429fc5e24de6c3fda. The website's Docker we used was downloaded from the official address https://github.com/web-arena-x/webarena/tree/main/environment_docker#wikipedia-website. And sorry that our task machines were recycled after the project was completed, which resulted in the loss of the log files.