Conduct Comparison of Evaluation Between Models

User story

As a Machine Learning Engineer
I want/need to compare the evaluation results between different models.
So that we can see if our fine-tuned model improves performance on CNCF-related questions.

Implement the following evaluations:
- Fine-tuned LLM: our initially fine-tuned model
- Base Model: Gemma
- Equivalent Competitor: LLama 2
Use the same test data set
- Organise the As from different models for the same Q
- Q: "How does the ScaleWorkload function facilitate the scaling of a workload to specified replicas?"
  - Fine-tuned: ...
  - Gemma: ...
  - LLama: ...
The selected Quantitative Metrics should be also compared for different models
Documenting our impressions about the fine-tuned LLM's answers
Document the results and upload the comparison results on Github
- Label our initially fine-tuned model clearly
- Leave space for updating evaluation on our further improved model
- The evaluation of the base model and equivalent competitor can be reusable