Benchmark 5 real smart contracts with EVMind

Everything necessary was done, and the following results were obtained:

GPT-4o and Claude 3.5: Both models show very similar performance in generating Solidity code. In some cases, Claude 3.5 outperforms GPT-4o, while in others, it is the opposite. Overall, it can be said that it is practically a tie.

Granite ft: Although it provides good, complete, and coherent results with what was requested, it does not surpass GPT-4o or Claude 3.5. Its responses are shorter and more limited. While its performance is remarkable for its 8B size, it is not comparable to the best models on the market.

Llama 3 ft: Its results are inferior to Granite ft, though it maintains a similar performance.

Conclusion: It is not possible to compete with the best models on the market using models as small as 8B. Although they deliver good results for their size and might be the best LLMs in their category for Solidity, both GPT-4o and Claude 3.5 generate better code.

Results: https://github.com/EveripediaNetwork/iq-code-evmind/tree/master/Benchmark%205

EveripediaNetwork / issues

Benchmark 5 real smart contracts with EVMind #2916