amosproj / amos2024ss08-cloud-native-llm

MIT License
6 stars 1 forks source link

Select Suitable Benchmark Metrics #80

Closed grayJiaaoLi closed 1 week ago

grayJiaaoLi commented 3 weeks ago

User story

  1. As a data engineer
  2. I want / need to evaluate the performance of the trained model
  3. So that we can further improve the model accordingly

Acceptance criteria

Definition of done (DoD)

DoD general criteria

dnsch commented 2 weeks ago

Report on Suitable Benchmark metrics Research Findings

To evaluate the performance of our fine-tuned LLM, it is critical that we find suitable benchmark metrics that reflect how well our model is in answering CNCF related questions

Findings

  1. The BiLingual Evaluation Understudy (BLEU) score evaluates machine translated text. While not ideal for our case of finetuning a Question-Answer LLM, we might be able to get some evaluation metric by comparing our previously generated question-answer pairs and compare the answer to a question of our question-answer pairs with the generated answer of our finetuned LLM.
    • This article (https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) proposes G-Eval which uses LLMs to evaluate LLM outputs. The G-Eval paper (https://arxiv.org/pdf/2303.16634) suggests that G-Eval is superior to other evaluation metrics, however it might introduce further ambiguities as with LLMs there is always some randomness involved and this could harm the accuracy of the model. If it works however, it might be interesting to see that we have a functioning pipeline that relies entirely on LLMs (from QAG to LLM evaluation)
    • Human evaluation: we might be able to ask our business partner to evaluate the model on some prepared questions as they are able to gauge the correctness of the answers as domain experts. This would arguably yield the best evaluation of the model, but it would work only on a small subset of questions and hence on a scale that is not sufficient for a full model evaluation.
    • Generating Multiple Choice Benchmark: We could ask an LLM to generate Q&A Questions based on some part of our data. Then we could use a few shot learning approach to evaluate models on this custom benchmark. This introduced a lot of fuzziness on what is train/test data then though. Also it would take a lot of compute time and resources.

Conclusion

Many possible options exist to evaluate the goodness of our finetuned model, but as the underlying base model might already be pretty good at answering specific questions about CNCF projects, it's hard to find an evaluation method that can provide us with a realistic metric. We will need to try different methods and see how they work. In the end, the domain experts will be the best judges when it comes to very specific and hard questions about CNCF related projects.