Build prompt experiment logging

After multiple days of knocking the idea around in my head, I finally feel like I have the right sketches ready for LLM experimentation.

To start, we take advantage of the content in message_log.db as a storage system. Since we already have prompt version hashing and logging, we don't need to do much more on that side. Instead, we can keep track of experiments and experiment_runs.

experiments are collections of experiment_runs. In one experiment_run, we are doing one round of text generation and one round of evaluation. Things that need to be logged here are:

all @prompt-decorated prompts that are used in the program and their versions,
all *Bot objects that are used in the program, their system prompt, temperature and model names,
the "conversation log", i.e. take advantage of the patterns set out in log_interaction here,
anything else that the user wants to log.

The log entry may look like this:

```json { "prompt_templates": { "some_prompt_func": { "template_id": }, "some_prompt_func2": { "template_id": }, }, "bots": { "bot_name1": { "model_name": "gpt-4o-mini", "system_prompts": , "temperature": 0.5 }, "bot_name2": { "model_name": "gpt-4o-mini", "system_prompts": , "temperature": 0.2 } }, "additional_information": { "arbitrary_key": "arbitrary_value", "arbitrary_key2": "arbitrary_value2" } } ```

In an evaluation of the experiment run, one can have arbitrary metrics stored. As metrics, I would like to enforce them to be scalar-valued. Booleans are represented by 0 and 1. Ordinals are integers from 0 to N. Scalar-valued floats are allowable. But no strings. That's not a metric. And one metric name (as defined by a function) gets one scalar-valued metric score per metric function. Here, the log entry for metrics will look like this:

```json { "metric_function_1": 23, "metric_function_2": 42, "metric_function_3": 0.8, "metric_function_4": -1.0 } ```

The final piece that we need is a context manager. It's API looks something like this:


# Experiment setup
@prompt("system")
def jdbot_sysprompt(type_of_manager):
    """You are an {{ type_of_manager }}."""

@prompt("user")
def jdbot_user_message(job_description):
    """Give me a name for an job that follows this description: {{ job_description }}."""

class JobDescription(BaseModel):
    name: str = field(..., description="A job name.")
    description: str = field(..., description="A job description."

@experiment
def name_generator():
    bot = StructuredBot(jdbot_sysprompt("data science manager"), model_name="gpt-4o", pydantic_model=JobDescription)
    response = bot(jdbot_user_message("someone who builds full stack AI apps"))
    return response

@metric # <-- this decorator validates that the eval function returns a scalar-type thing
def name_length(response):
    return len(response.name)

@prompt("system")
def judgebot_sysprompt():
    """You are a judge of how cool a name is."""

@prompt("user")
def judgebot_userprompt(namebot_response):
    """Return for me your coolness score: 1-10."""

class JobNameCoolness(BaseModel):
    score: int = field(..., description="How cool the job name is. 1 = not cool, 10 = amazeballer.")

@metric  
def llm_judge(namebot_response):
    judgebot = StructuredBot(model_name="gpt-4o", pydantic_model=JobNameCoolness)
    coolness = judgebot(judgebot_userprompt(namebot_response)
    return coolness

# Experiment execution. Each execution of this experiment gets us one new run.
with Experiment("experiment_name", num_executions=10) as experiment:
    # we auto-parallelize the code inside the context manager block 10X using concurrent threadpool. 
    # logging automatically happens within the Experiment context.
    response = name_generator()
    name_length(response)
    llm_judge(response)

Ok, I'm tired, will let this sit for a while.

ericmjl / llamabot

Build prompt experiment logging #142