(spike) Establish Metrics for RAG Evaluations - Githubissues

defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.

https://leapfrog.ai

Apache License 2.0

250 stars 28 forks source link

(spike) Establish Metrics for RAG Evaluations #718

Closed jalling97 closed 1 week ago

jalling97 commented 2 months ago

Description

Metrics make or break an evaluation framework, so it's important to choose metrics that align with overall scope and goals.

There are two major types of metrics that will be used:

Heuristic-based: metrics that can be determined with a hard-coded function that does some computation.
LLM-as-a-judge: metrics that use an LLM to score system output. This is required for subjective tasks that can usually only be done with a human-in-the-loop

To make an MVP evaluation framework, both will need to be leveraged (heuristics for high-trust ground truth testing and LLM-as-judge for subjective tasks that can easily be added)

Completion Criteria

[x] Select LLM-as-a-judge metrics that best suit the evaluation needs of RAG
[x] Implement custom metrics to handle heuristic testing
- e.g needle in a haystack (given a string, can I find it in the retrieved context?), JSON parseable response, etc.

Relevant Links

DeepEval built-in metrics
DeepEval custom metrics
Benchmarks can potentially be added to ensure model implementations are performing as expected

Decision

jalling97 commented 1 month ago

LLM-as-a-judge RAG Eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#evaluating-retrieval

jalling97 commented 1 month ago

LLM-as-a-judge custom eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#beyond-generic-evaluation

jalling97 commented 1 month ago

More general custom metrics in DeepEval (for heuristics): https://docs.confident-ai.com/docs/metrics-custom

jalling97 commented 1 month ago

Stick to categorical evaluations: https://arize.com/blog-course/numeric-evals-for-llm-as-a-judge/

jalling97 commented 1 month ago

Needle in a haystack will be an important eval to implement: https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/

jalling97 commented 2 weeks ago

Current LLM-as-a-judge metrics to use:

Contextual Precision (for evaluating retrieval)
Contextual Recall (for evaluating retrieval)
Answer Relevancy (for evaluating generation)
Faithfulness (for evaluating generation)

Non-llm-enabled evaluations:

Needle in a Haystack (for evaluating retrieval and generation)

Non-RAG LLM benchmarks:

HumanEval (for evaluating generation)
- This is to compare to an established baseline

jalling97 commented 1 week ago

Update to the metrics that shall be used:

The LeapfrogAI RAG evaluation framework will utilize the following evaluations:

LLM-as-a-judge metrics to use:

Contextual Recall (for evaluating retrieval)
Answer Correctness (for evaluating generation)
Faithfulness (for evaluating generation)

Non-llm-enabled evaluations:

Needle in a Haystack (for evaluating retrieval and generation, implementation here)

Non-RAG LLM benchmarks:

HumanEval (for evaluating generation)

jalling97 commented 1 week ago

Closing this to move forward on implementation