Right now in the evaluation directory, the directory structure is very flat, and it is hard to tell which subdirectories are utilities related to implementing benchmarks or doing basic tests for openhands (utils, integration_tests, regression, static), and which are actual benchmarks from the ML literature (everything else).
To make this more clear, we can move all benchmarks to live under the evaluation/benchmarks/ directory. In addition, all other files that have to do with evaluation (including documentation, github workflows, etc.) will need to be checked and changed to maintain consistency.
While we do this, we can also add some of the benchmarks that are missing from the evaluation/README.md documentation.
What problem or use case are you trying to solve?
Right now in the evaluation directory, the directory structure is very flat, and it is hard to tell which subdirectories are utilities related to implementing benchmarks or doing basic tests for openhands (
utils
,integration_tests
,regression
,static
), and which are actual benchmarks from the ML literature (everything else).To make this more clear, we can move all benchmarks to live under the
evaluation/benchmarks/
directory. In addition, all other files that have to do with evaluation (including documentation, github workflows, etc.) will need to be checked and changed to maintain consistency.While we do this, we can also add some of the benchmarks that are missing from the
evaluation/README.md
documentation.