Closed brunneis closed 3 months ago
We should define an array of benchmarks, similar to common leaderboards, whose weighted average could be correlated with our definition of a good Solidity LLM.
The initial benchmark discussed was an LLM Judge that will compare the implemented functionality of the source code from the test split with the generated code of the target LLM. Determining if the LLM-generated code implements the specification in the prompt is challenging, as it's not as simple as passing a test with multiple-choice questions.
Among secondary benchmarks that can be used to define the leaderboard are:
Some of the alternative benchmarks are not great on their own, but together can define a good score. For instance, code length is not a great estimator on its own but could break ties between LLMs with similar performance in other benchmarks as the learned target is to generate code as similar as possible to the original given the prompt.
After talking with @not-lain we agree it's best to proceed as close as possible to current standards. For that we can port / add support for Solidity on standard benchmarks such as HumanEval and ClassEval. After, we may add our own original evaluations if necessary.
Context:
Evaluating the performance of EVMind LLMs autonomously seems not straightforward for our use case. Analyze how other leaderboards evaluate LLMs and if those methods make sense for our use case.