Open brunneis opened 1 month ago
Everything necessary was done, and the following results were obtained:
GPT-4o and Claude 3.5: Both models show very similar performance in generating Solidity code. In some cases, Claude 3.5 outperforms GPT-4o, while in others, it is the opposite. Overall, it can be said that it is practically a tie.
Granite ft: Although it provides good, complete, and coherent results with what was requested, it does not surpass GPT-4o or Claude 3.5. Its responses are shorter and more limited. While its performance is remarkable for its 8B size, it is not comparable to the best models on the market.
Llama 3 ft: Its results are inferior to Granite ft, though it maintains a similar performance.
Conclusion: It is not possible to compete with the best models on the market using models as small as 8B. Although they deliver good results for their size and might be the best LLMs in their category for Solidity, both GPT-4o and Claude 3.5 generate better code.
Results: https://github.com/EveripediaNetwork/iq-code-evmind/tree/master/Benchmark%205
Pick 5 real smart contracts with less than 4096 tokens and reverse an "average" prompt out of them with a prompt similar to the used to generate the dataset:
For each example, generate two prompts with both GPT 4o and Sonnet 3.5.
With the obtained prompts, generate the code with the two best finetunings so far (Granite and Llama 3), and with GPT 4o and Sonnet 3.5.
Compare the generated results with the original code and evaluate if the finetunings are effective.