defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.
https://leapfrog.ai
Apache License 2.0
250 stars 28 forks source link

(spike) Establish Metrics for RAG Evaluations #718

Closed jalling97 closed 1 week ago

jalling97 commented 2 months ago

Description

Metrics make or break an evaluation framework, so it's important to choose metrics that align with overall scope and goals.

There are two major types of metrics that will be used:

To make an MVP evaluation framework, both will need to be leveraged (heuristics for high-trust ground truth testing and LLM-as-judge for subjective tasks that can easily be added)

Completion Criteria

Relevant Links

Decision

See comment

jalling97 commented 1 month ago

LLM-as-a-judge RAG Eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#evaluating-retrieval

jalling97 commented 1 month ago

LLM-as-a-judge custom eval metrics in DeepEval: https://docs.confident-ai.com/docs/guides-rag-evaluation#beyond-generic-evaluation

jalling97 commented 1 month ago

More general custom metrics in DeepEval (for heuristics): https://docs.confident-ai.com/docs/metrics-custom

jalling97 commented 1 month ago

Stick to categorical evaluations: https://arize.com/blog-course/numeric-evals-for-llm-as-a-judge/

jalling97 commented 1 month ago

Needle in a haystack will be an important eval to implement: https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/

jalling97 commented 2 weeks ago

Current LLM-as-a-judge metrics to use:

Non-llm-enabled evaluations:

Non-RAG LLM benchmarks:

jalling97 commented 1 week ago

Update to the metrics that shall be used:

The LeapfrogAI RAG evaluation framework will utilize the following evaluations:

LLM-as-a-judge metrics to use:

Non-llm-enabled evaluations:

Non-RAG LLM benchmarks:

jalling97 commented 1 week ago

Closing this to move forward on implementation