Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.45k stars 255 forks source link

🗺 LLM Evaluations #918

Closed mikeldking closed 8 months ago

mikeldking commented 1 year ago

As a user, I want to be able to evaluate my LLM application's execution using a set of evaluations. These metrics should help identify inference or trace cohorts that have degraded performance. Since LLMs have a huge amount of deduction power, we should leverage it's ability to decipher things like relevancy to create a set of building blocks that can be used to derive more traditional evaluation metrics like "accuracy" and "NCDG".

Milestone 1

Build the foundational building blocks to be able to evaluate a model and template combination. This requires building a BaseEvals class or execution function. All evals at this stage should sit under /experimental

The primitive function would look something like:


# A function that takes a prompt template, a set of variables to apply, a model
# (LLM) to invoke it with, the invocation parameters, and a guard rails to force
# the LLM output to conform to a specific structure. The function returns a
# number always.abs
def evaluate(prompt_template, variables, model = OpenAIModel, invocation_parameters, output_structure):
    """
    A function that takes a prompt template, a set of variables to apply, a model
    (LLM) to invoke it with, the invocation parameters, and a guard rails to force
    the LLM output to conform to a specific structure. The function returns a
    number always.

    Parameters
    ----------
    prompt_template : str
        A string containing the prompt template to use.
    variables : dict
        A dictionary containing the variables to apply to the prompt template.
    model : LLM
        The LLM to invoke.
    invocation_parameters : dict
        A dictionary containing the invocation parameters for the LLM
    output_structure : dict
        A representation of the structure in which you want the output the data
        e.x. { 0: "relevant", 1: "irrelevant" }

    Examples
    --------
    >>> evaluate("{query} is related to  {reference}.", { "query": "cat", "reference": "tiger" }, model, { "max_tokens": 10, "temperature": 0.5, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "stop": ["."] }, { 0: "relevant", 1: "irrelevant" })    
    """
    # Generate the prompt
    prompt = prompt_template.format(**variables)
    prompt = apply_structure(prompt, structure)
    # Invoke the model
    response = model.query(prompt, invocation_parameters)

    return validate_and_parse_response(response, guard_rails)

The above function should be usable to build out more metrics based on templates over a dataframe. It will be extended to compute things like "hallucinations" and "toxicity" in the long run.

Developer Experience

Models

Data Science Rigor

Documentation

Finalize

References

PRD - https://docs.google.com/document/d/1I4rgr6UBB6UhuOjtcO2sTu-P5hlgDFC1YA1UZJezqog/edit#heading=h.w2uwim8q3nse RAGAS - RAG assessment: https://explodinggradients.com/all-about-evaluating-large-language-models Notebook - https://colab.research.google.com/drive/136DtFTWSCN22FvLEJsJ6iJdFvPCya2Xl#scrollTo=ZScuXSFoc_8E

mikeldking commented 8 months ago

Completed in 2023