🗺 LLM Evaluations - Githubissues

As a user, I want to be able to evaluate my LLM application's execution using a set of evaluations. These metrics should help identify inference or trace cohorts that have degraded performance. Since LLMs have a huge amount of deduction power, we should leverage it's ability to decipher things like relevancy to create a set of building blocks that can be used to derive more traditional evaluation metrics like "accuracy" and "NCDG".

Milestone 1

Build the foundational building blocks to be able to evaluate a model and template combination. This requires building a BaseEvals class or execution function. All evals at this stage should sit under /experimental

The primitive function would look something like:


# A function that takes a prompt template, a set of variables to apply, a model
# (LLM) to invoke it with, the invocation parameters, and a guard rails to force
# the LLM output to conform to a specific structure. The function returns a
# number always.abs
def evaluate(prompt_template, variables, model = OpenAIModel, invocation_parameters, output_structure):
    """
    A function that takes a prompt template, a set of variables to apply, a model
    (LLM) to invoke it with, the invocation parameters, and a guard rails to force
    the LLM output to conform to a specific structure. The function returns a
    number always.

    Parameters
    ----------
    prompt_template : str
        A string containing the prompt template to use.
    variables : dict
        A dictionary containing the variables to apply to the prompt template.
    model : LLM
        The LLM to invoke.
    invocation_parameters : dict
        A dictionary containing the invocation parameters for the LLM
    output_structure : dict
        A representation of the structure in which you want the output the data
        e.x. { 0: "relevant", 1: "irrelevant" }

    Examples
    --------
    >>> evaluate("{query} is related to  {reference}.", { "query": "cat", "reference": "tiger" }, model, { "max_tokens": 10, "temperature": 0.5, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "stop": ["."] }, { 0: "relevant", 1: "irrelevant" })    
    """
    # Generate the prompt
    prompt = prompt_template.format(**variables)
    prompt = apply_structure(prompt, structure)
    # Invoke the model
    response = model.query(prompt, invocation_parameters)

    return validate_and_parse_response(response, guard_rails)

The above function should be usable to build out more metrics based on templates over a dataframe. It will be extended to compute things like "hallucinations" and "toxicity" in the long run.

[x] #1204
[x] #1205
[x] #1206
[x] #1201
[x] #1207
[x] #1208
[x] #1202
[x] #1203
[x] #1276
[x] [evels] hallucination dataset and experimentation notebok
[x] #1264
[x] #1584
[x] #1213
[x] #1247
[x] #1544
[x] #1242
[x] #1255
[x] #1267
[x] #1269
[x] #1329
[x] #1328
[x] #1327
[x] #1399
[x] #1444
[x] #1445
[x] #1663

Developer Experience

[x] #1272
[x] #1480

Models

[x] #1482
[x] #1545
[x] #1546

Data Science Rigor

[x] #1547
[ ] [evals] run evals over Vertex AI

Documentation

[x] #1324
[x] #1358
[x] #1443
[x] #1481

Finalize

[ ] [evals] move out of experimental

References

PRD - https://docs.google.com/document/d/1I4rgr6UBB6UhuOjtcO2sTu-P5hlgDFC1YA1UZJezqog/edit#heading=h.w2uwim8q3nse RAGAS - RAG assessment: https://explodinggradients.com/all-about-evaluating-large-language-models Notebook - https://colab.research.google.com/drive/136DtFTWSCN22FvLEJsJ6iJdFvPCya2Xl#scrollTo=ZScuXSFoc_8E

Arize-ai / phoenix

🗺 LLM Evaluations #918