Open csmith49 opened 5 days ago
This PR is currently blocked by #4848, reproducible as follows:
mlebench
(instructions here) and build the mlebench-env
image (instructions here).mlebench prepare -c spaceship-titanic
.mlebench-env
image with OpenHands by navigating to evaluation/mle-bench
and running docker build --platform=linux/amd64 -t openhands agents/openhands/
.python run_infer.py --agent-id openhands --competition-set experiments/splits/spaceship-titanic.txt
.Checking agent.log
for the run shows:
[92m18:53:07 - openhands:INFO[0m: runtime_build.py:176 - Building image: ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr
[92m18:54:13 - openhands:ERROR[0m: docker.py:122 - Image build failed:
Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
[92m18:54:13 - openhands:ERROR[0m: docker.py:123 - Command output:
Runtime created.
================ DOCKER BUILD STARTED ================
ERROR:root: File "/home/agent/start.py", line 188, in <module>
asyncio.run(run(instructions))
File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/agent/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/agent/start.py", line 70, in run
await runtime.connect()
File "/home/agent/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 225, in connect
self.runtime_container_image = build_runtime_image(
^^^^^^^^^^^^^^^^^^^^
File "/home/agent/openhands/runtime/utils/runtime_build.py", line 134, in build_runtime_image
result = build_runtime_image_in_folder(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/agent/openhands/runtime/utils/runtime_build.py", line 225, in build_runtime_image_in_folder
_build_sandbox_image(
File "/home/agent/openhands/runtime/utils/runtime_build.py", line 352, in _build_sandbox_image
image_name = runtime_builder.build(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/agent/openhands/runtime/builder/docker.py", line 114, in build
raise subprocess.CalledProcessError(
ERROR:root:<class 'subprocess.CalledProcessError'>: Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
ERROR conda.cli.main_run:execute(125): `conda run python start.py --agent CodeActAgent --model gpt-4o --max_time_in_hours 24 --max_steps 500 --shm_size 100G` failed. (See above for error)
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.
The MLE-bench implementation provides:
agent
definition format.The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands
agent
definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding aroundagent
definitions to allow for benchmarking from this repo.Link of any specific issues this addresses
4328