After multiple days of knocking the idea around in my head, I finally feel like I have the right sketches ready for LLM experimentation.
To start, we take advantage of the content in message_log.db as a storage system. Since we already have prompt version hashing and logging, we don't need to do much more on that side. Instead, we can keep track of experiments and experiment_runs.
experiments are collections of experiment_runs. In one experiment_run, we are doing one round of text generation and one round of evaluation. Things that need to be logged here are:
all @prompt-decorated prompts that are used in the program and their versions,
all *Bot objects that are used in the program, their system prompt, temperature and model names,
the "conversation log", i.e. take advantage of the patterns set out in log_interaction here,
In an evaluation of the experiment run, one can have arbitrary metrics stored. As metrics, I would like to enforce them to be scalar-valued. Booleans are represented by 0 and 1. Ordinals are integers from 0 to N. Scalar-valued floats are allowable. But no strings. That's not a metric. And one metric name (as defined by a function) gets one scalar-valued metric score per metric function. Here, the log entry for metrics will look like this:
The final piece that we need is a context manager. It's API looks something like this:
# Experiment setup
@prompt("system")
def jdbot_sysprompt(type_of_manager):
"""You are an {{ type_of_manager }}."""
@prompt("user")
def jdbot_user_message(job_description):
"""Give me a name for an job that follows this description: {{ job_description }}."""
class JobDescription(BaseModel):
name: str = field(..., description="A job name.")
description: str = field(..., description="A job description."
@experiment
def name_generator():
bot = StructuredBot(jdbot_sysprompt("data science manager"), model_name="gpt-4o", pydantic_model=JobDescription)
response = bot(jdbot_user_message("someone who builds full stack AI apps"))
return response
@metric # <-- this decorator validates that the eval function returns a scalar-type thing
def name_length(response):
return len(response.name)
@prompt("system")
def judgebot_sysprompt():
"""You are a judge of how cool a name is."""
@prompt("user")
def judgebot_userprompt(namebot_response):
"""Return for me your coolness score: 1-10."""
class JobNameCoolness(BaseModel):
score: int = field(..., description="How cool the job name is. 1 = not cool, 10 = amazeballer.")
@metric
def llm_judge(namebot_response):
judgebot = StructuredBot(model_name="gpt-4o", pydantic_model=JobNameCoolness)
coolness = judgebot(judgebot_userprompt(namebot_response)
return coolness
# Experiment execution. Each execution of this experiment gets us one new run.
with Experiment("experiment_name", num_executions=10) as experiment:
# we auto-parallelize the code inside the context manager block 10X using concurrent threadpool.
# logging automatically happens within the Experiment context.
response = name_generator()
name_length(response)
llm_judge(response)
After multiple days of knocking the idea around in my head, I finally feel like I have the right sketches ready for LLM experimentation.
To start, we take advantage of the content in
message_log.db
as a storage system. Since we already have prompt version hashing and logging, we don't need to do much more on that side. Instead, we can keep track ofexperiments
andexperiment_runs
.experiments
are collections ofexperiment_runs
. In oneexperiment_run
, we are doing one round of text generation and one round of evaluation. Things that need to be logged here are:@prompt
-decorated prompts that are used in the program and their versions,*Bot
objects that are used in the program, their system prompt, temperature and model names,log_interaction
here,The log entry may look like this:
In an evaluation of the experiment run, one can have arbitrary metrics stored. As metrics, I would like to enforce them to be scalar-valued. Booleans are represented by 0 and 1. Ordinals are integers from 0 to N. Scalar-valued floats are allowable. But no strings. That's not a metric. And one metric name (as defined by a function) gets one scalar-valued metric score per metric function. Here, the log entry for
metrics
will look like this:The final piece that we need is a context manager. It's API looks something like this:
Ok, I'm tired, will let this sit for a while.