As a user, I want to be able to evaluate my LLM application's execution using a set of evaluations. These metrics should help identify inference or trace cohorts that have degraded performance. Since LLMs have a huge amount of deduction power, we should leverage it's ability to decipher things like relevancy to create a set of building blocks that can be used to derive more traditional evaluation metrics like "accuracy" and "NCDG".
Milestone 1
Build the foundational building blocks to be able to evaluate a model and template combination. This requires building a BaseEvals class or execution function. All evals at this stage should sit under /experimental
The primitive function would look something like:
# A function that takes a prompt template, a set of variables to apply, a model
# (LLM) to invoke it with, the invocation parameters, and a guard rails to force
# the LLM output to conform to a specific structure. The function returns a
# number always.abs
def evaluate(prompt_template, variables, model = OpenAIModel, invocation_parameters, output_structure):
"""
A function that takes a prompt template, a set of variables to apply, a model
(LLM) to invoke it with, the invocation parameters, and a guard rails to force
the LLM output to conform to a specific structure. The function returns a
number always.
Parameters
----------
prompt_template : str
A string containing the prompt template to use.
variables : dict
A dictionary containing the variables to apply to the prompt template.
model : LLM
The LLM to invoke.
invocation_parameters : dict
A dictionary containing the invocation parameters for the LLM
output_structure : dict
A representation of the structure in which you want the output the data
e.x. { 0: "relevant", 1: "irrelevant" }
Examples
--------
>>> evaluate("{query} is related to {reference}.", { "query": "cat", "reference": "tiger" }, model, { "max_tokens": 10, "temperature": 0.5, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "stop": ["."] }, { 0: "relevant", 1: "irrelevant" })
"""
# Generate the prompt
prompt = prompt_template.format(**variables)
prompt = apply_structure(prompt, structure)
# Invoke the model
response = model.query(prompt, invocation_parameters)
return validate_and_parse_response(response, guard_rails)
The above function should be usable to build out more metrics based on templates over a dataframe. It will be extended to compute things like "hallucinations" and "toxicity" in the long run.
[x] #1204
[x] #1205
[x] #1206
[x] #1201
[x] #1207
[x] #1208
[x] #1202
[x] #1203
[x] #1276
[x] [evels] hallucination dataset and experimentation notebok
As a user, I want to be able to evaluate my LLM application's execution using a set of evaluations. These metrics should help identify inference or trace cohorts that have degraded performance. Since LLMs have a huge amount of deduction power, we should leverage it's ability to decipher things like relevancy to create a set of building blocks that can be used to derive more traditional evaluation metrics like "accuracy" and "NCDG".
Milestone 1
Build the foundational building blocks to be able to evaluate a model and template combination. This requires building a
BaseEvals
class or execution function. All evals at this stage should sit under/experimental
The primitive function would look something like:
The above function should be usable to build out more metrics based on templates over a dataframe. It will be extended to compute things like "hallucinations" and "toxicity" in the long run.
Developer Experience
Models
Data Science Rigor
Documentation
Finalize
References
PRD - https://docs.google.com/document/d/1I4rgr6UBB6UhuOjtcO2sTu-P5hlgDFC1YA1UZJezqog/edit#heading=h.w2uwim8q3nse RAGAS - RAG assessment: https://explodinggradients.com/all-about-evaluating-large-language-models Notebook - https://colab.research.google.com/drive/136DtFTWSCN22FvLEJsJ6iJdFvPCya2Xl#scrollTo=ZScuXSFoc_8E