Explore LLM Evaluation with LangSmith

Context:

Conducting evaluations from the output of RAG models is more and more important as these models continued to be developed and deployed. Although there is not a true standard in evaluating RAG model output, the work associated with this issue is an attempt to provide a workflow to evaluate output in a programmatic fashion.

Objective:

Demonstrate how to evaluate RAG model output using annotated questions and answers. Additionally, highlight the ability for various RAG models to respond to known question and answers. The number of known questions and answers can be small but at least 10.

Path to Completion:

[ ] Establish a payment method on OpenAI
[ ] Curate an annotated dataset
[ ] Follow a simple evaluation scheme using langchain
[ ] Monitor and compare output using langsmith

References:

ragas || Package that uses OpenAI to evaluate RAG output (Exploding Gradients) || LINK
"Test Run Comparisons" || Description on how to compare RAG output across different test runs (Langchain) || LINK

Overtrained / contextual-qa-chat-app