Analyze the performance of the finetuned LLMs (beginner, average, expert, all, <=3500 words)

Which model produces the best code when tested with prompts of varying difficulty levels ("beginner," "average," and "expert") that are not present in the dataset? It was the intermediate model that gave the best results; it was the only one that responded correctly to the first test. Additionally, its subsequent responses were on par or equal to those of the other models.

Does the fine-tuned version of the model, limited to 3,500 words of code, perform better than the version without this limitation? They are very similar. Perhaps this new dataset is a bit more coherent, but it's not a significant difference. However, better results were obtained with fewer data, which is an advantage.

Does the best-performing LLM show significant improvement over the base model? Yes, of course. It always generates much more code and tries to do it in the most complete way, with its advantages and disadvantages.

Does the base model perform significantly better than a general-purpose LLM fine-tuned in the same way (e.g., Gemma 2, Llama 3)? With Llama 3, the results were similar, though some aspects were worse, so I consider it somewhat inferior to the Granite models we developed. With Gemma 2, the fine-tuning was flawed; Gemma models were never good with fine-tuning, and this case was no exception—the worst model of all.

EveripediaNetwork / issues

Analyze the performance of the finetuned LLMs (beginner, average, expert, all, <=3500 words) #2891