EveripediaNetwork / issues

Issues repo
9 stars 0 forks source link

Benchmark 5 real smart contracts with EVMind #2916

Open brunneis opened 1 month ago

brunneis commented 1 month ago

Pick 5 real smart contracts with less than 4096 tokens and reverse an "average" prompt out of them with a prompt similar to the used to generate the dataset:

Create a prompt for an average user with some technical knowledge. Include a general description of the desired functionality and some high-level technical details on how the smart contract should be structured.

For each example, generate two prompts with both GPT 4o and Sonnet 3.5.

With the obtained prompts, generate the code with the two best finetunings so far (Granite and Llama 3), and with GPT 4o and Sonnet 3.5.

Compare the generated results with the original code and evaluate if the finetunings are effective.

danielbrdz commented 1 month ago

Everything necessary was done, and the following results were obtained:

GPT-4o and Claude 3.5: Both models show very similar performance in generating Solidity code. In some cases, Claude 3.5 outperforms GPT-4o, while in others, it is the opposite. Overall, it can be said that it is practically a tie.

Granite ft: Although it provides good, complete, and coherent results with what was requested, it does not surpass GPT-4o or Claude 3.5. Its responses are shorter and more limited. While its performance is remarkable for its 8B size, it is not comparable to the best models on the market.

Llama 3 ft: Its results are inferior to Granite ft, though it maintains a similar performance.

Conclusion: It is not possible to compete with the best models on the market using models as small as 8B. Although they deliver good results for their size and might be the best LLMs in their category for Solidity, both GPT-4o and Claude 3.5 generate better code.

Results: https://github.com/EveripediaNetwork/iq-code-evmind/tree/master/Benchmark%205