Closed grayJiaaoLi closed 1 week ago
To evaluate the performance of our fine-tuned LLM, it is critical that we find suitable benchmark metrics that reflect how well our model is in answering CNCF related questions
Many possible options exist to evaluate the goodness of our finetuned model, but as the underlying base model might already be pretty good at answering specific questions about CNCF projects, it's hard to find an evaluation method that can provide us with a realistic metric. We will need to try different methods and see how they work. In the end, the domain experts will be the best judges when it comes to very specific and hard questions about CNCF related projects.
User story
Acceptance criteria
Any existing and proven scoring systems can be used
The evaluation focuses on the questions answering ability
Factual accuracy is our primary goal
Response time should be calculated
Definition of done (DoD)
DoD general criteria