User story

As a data engineer
I want / need to evaluate the performance of the trained model
So that we can further improve the model accordingly

Acceptance criteria

Any existing and proven scoring systems can be used
- Precision, Recall, and F1 Score
- SacreBLEU, BLEU
The evaluation focuses on the questions answering ability
- focus on the CNCF related topics
Factual accuracy is our primary goal
- How many of the model's answers are correct
Response time should be calculated
- How quickly the model provides answers
- User experience

Definition of done (DoD)

Bill of Materials in the planning document has been updated
All the complex logics have been tested
All feature branches have been merged and closed
New feature code has been documented
Potential new licenses have been checked
All GitHub Actions are passing
The requirement.txt is updated

DoD general criteria

Feature has been fully implemented
Feature has been merged into the mainline
All acceptance criteria were met
Product owner approved features
All tests are passing
Developers agreed to release

Report on Suitable Benchmark metrics Research Findings

To evaluate the performance of our fine-tuned LLM, it is critical that we find suitable benchmark metrics that reflect how well our model is in answering CNCF related questions

Findings

The BiLingual Evaluation Understudy (BLEU) score evaluates machine translated text. While not ideal for our case of finetuning a Question-Answer LLM, we might be able to get some evaluation metric by comparing our previously generated question-answer pairs and compare the answer to a question of our question-answer pairs with the generated answer of our finetuned LLM.
- This article (https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) proposes G-Eval which uses LLMs to evaluate LLM outputs. The G-Eval paper (https://arxiv.org/pdf/2303.16634) suggests that G-Eval is superior to other evaluation metrics, however it might introduce further ambiguities as with LLMs there is always some randomness involved and this could harm the accuracy of the model. If it works however, it might be interesting to see that we have a functioning pipeline that relies entirely on LLMs (from QAG to LLM evaluation)
- Human evaluation: we might be able to ask our business partner to evaluate the model on some prepared questions as they are able to gauge the correctness of the answers as domain experts. This would arguably yield the best evaluation of the model, but it would work only on a small subset of questions and hence on a scale that is not sufficient for a full model evaluation.
- Generating Multiple Choice Benchmark: We could ask an LLM to generate Q&A Questions based on some part of our data. Then we could use a few shot learning approach to evaluate models on this custom benchmark. This introduced a lot of fuzziness on what is train/test data then though. Also it would take a lot of compute time and resources.

Conclusion

Many possible options exist to evaluate the goodness of our finetuned model, but as the underlying base model might already be pretty good at answering specific questions about CNCF projects, it's hard to find an evaluation method that can provide us with a realistic metric. We will need to try different methods and see how they work. In the end, the domain experts will be the best judges when it comes to very specific and hard questions about CNCF related projects.

amosproj / amos2024ss08-cloud-native-llm

Select Suitable Benchmark Metrics #80