End-user friendly description of the problem this fixes or functionality that this introduces

[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.

The MLE-bench implementation provides:

A set of scripts to manage test instances, run benchmarks, and score results.
A base Docker image in which agents should be run.
An agent definition format.

The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.

Link of any specific issues this addresses

4328

This PR is currently blocked by #4848, reproducible as follows:

Install mlebench (instructions here) and build the mlebench-env image (instructions here).
Grab some data (instructions here) by running mlebench prepare -c spaceship-titanic.
Extend the mlebench-env image with OpenHands by navigating to evaluation/mle-bench and running docker build --platform=linux/amd64 -t openhands agents/openhands/.
Run python run_infer.py --agent-id openhands --competition-set experiments/splits/spaceship-titanic.txt.

Checking agent.log for the run shows:

[92m18:53:07 - openhands:INFO[0m: runtime_build.py:176 - Building image: ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr
[92m18:54:13 - openhands:ERROR[0m: docker.py:122 - Image build failed:
Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
[92m18:54:13 - openhands:ERROR[0m: docker.py:123 - Command output:

Runtime created.
================ DOCKER BUILD STARTED ================
ERROR:root:  File "/home/agent/start.py", line 188, in <module>
    asyncio.run(run(instructions))
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/agent/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/agent/start.py", line 70, in run
    await runtime.connect()
  File "/home/agent/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 225, in connect
    self.runtime_container_image = build_runtime_image(
                                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 134, in build_runtime_image
    result = build_runtime_image_in_folder(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 225, in build_runtime_image_in_folder
    _build_sandbox_image(
  File "/home/agent/openhands/runtime/utils/runtime_build.py", line 352, in _build_sandbox_image
    image_name = runtime_builder.build(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/agent/openhands/runtime/builder/docker.py", line 114, in build
    raise subprocess.CalledProcessError(

ERROR:root:<class 'subprocess.CalledProcessError'>: Command '['docker', 'buildx', 'build', '--progress=plain', '--build-arg=OPENHANDS_RUNTIME_VERSION=0.14.1', '--build-arg=OPENHANDS_RUNTIME_BUILD_TIME=2024-11-21T18:53:09.520454', '--tag=ghcr.io/all-hands-ai/runtime:oh_v0.14.1_d66eiz7humvbba2v_gph61dpe1atpybgr', '--load', '/tmp/tmpmsoitid7']' returned non-zero exit status 1.
ERROR conda.cli.main_run:execute(125): `conda run python start.py --agent CodeActAgent --model gpt-4o --max_time_in_hours 24 --max_steps 500 --shm_size 100G` failed. (See above for error)

All-Hands-AI / OpenHands

Feat/mle bench evaluation #5148

4328