EveripediaNetwork / issues

Issues repo
9 stars 0 forks source link

Analyze different ways to evaluate automatically our LLMs autonomously #2971

Closed brunneis closed 3 months ago

brunneis commented 3 months ago

Evaluating the performance of EVMind LLMs autonomously seems not straightforward for our use case. Analyze how other leaderboards evaluate LLMs and if those methods make sense for our use case.

brunneis commented 3 months ago

We should define an array of benchmarks, similar to common leaderboards, whose weighted average could be correlated with our definition of a good Solidity LLM.

The initial benchmark discussed was an LLM Judge that will compare the implemented functionality of the source code from the test split with the generated code of the target LLM. Determining if the LLM-generated code implements the specification in the prompt is challenging, as it's not as simple as passing a test with multiple-choice questions.

Among secondary benchmarks that can be used to define the leaderboard are:

  1. Code similarity: How semantically similar is the output (using embeddings)?
  2. Code length: How similar is the code length?
  3. Solidity understanding: How well does it understand Solidity? (This could use a new dataset with code snippets and explanations, transformed into a boolean test to avoid LLM judges)
  4. Vulnerability identification: How well does it identify vulnerabilities? (This could use a new guided dataset, e.g., "Does this snippet contain re-entrancy: yes/no?")

Some of the alternative benchmarks are not great on their own, but together can define a good score. For instance, code length is not a great estimator on its own but could break ties between LLMs with similar performance in other benchmarks as the learned target is to generate code as similar as possible to the original given the prompt.

brunneis commented 3 months ago

After talking with @not-lain we agree it's best to proceed as close as possible to current standards. For that we can port / add support for Solidity on standard benchmarks such as HumanEval and ClassEval. After, we may add our own original evaluations if necessary.

Context: