Closed jalling97 closed 1 week ago
LLM-as-a-judge RAG Eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#evaluating-retrieval
LLM-as-a-judge custom eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#beyond-generic-evaluation
More general custom metrics in DeepEval (for heuristics): https://docs.confident-ai.com/docs/metrics-custom
Stick to categorical evaluations: https://arize.com/blog-course/numeric-evals-for-llm-as-a-judge/
Needle in a haystack will be an important eval to implement: https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/
Current LLM-as-a-judge metrics to use:
Non-llm-enabled evaluations:
Non-RAG LLM benchmarks:
Update to the metrics that shall be used:
The LeapfrogAI RAG evaluation framework will utilize the following evaluations:
LLM-as-a-judge metrics to use:
Non-llm-enabled evaluations:
Non-RAG LLM benchmarks:
Closing this to move forward on implementation
Description
Metrics make or break an evaluation framework, so it's important to choose metrics that align with overall scope and goals.
There are two major types of metrics that will be used:
To make an MVP evaluation framework, both will need to be leveraged (heuristics for high-trust ground truth testing and LLM-as-judge for subjective tasks that can easily be added)
Completion Criteria
Relevant Links
Decision
See comment