LukeLalor commented 1 month ago

Problem

prompt engineering is slow, and in a multi-agent system is very hard. You need to experiment with a message arbitrarily deep in the stack quickly. The current replay feature is helpful, but does not work well with docker / k8s deployments, is hard to correlate llm requests (ie, find where you are), and is only "point in time experimentation" rather than modifying the rest of the request.

Out of Scope

This is supposed to be a prompt engineering tool. Developers should use code breakpoints when building custom agent templates or tools.

Proposal

Introduce a debugger into the eidolon dev tools component. Users can set break points on agents / tools and step through them using the debugger as they are used to in their IDE. Furthermore, they can review a conversation after-the-fact as well, inspect llm requests, and experiment with executing requests with alternative parameters.

Debug Pane

debugger controls, stack, evaluation option(s), variables, save (propagate variable changes to the system)

#################################################################################
#                                       #
#   Stop * Resume * Step Over * Step Into * Step Out Of         #
#   -----------------------------------------------------------------   #
#   Stack               |   evaluate: > execute_llm     #
#   worker_agent.execute_llm    |   -------------------------   #
#   manager_agent.execute_tool_call |   messages = [...]        #
#   ...             |   tools = [...]           #
#                   |   output_format = {...}       #
#                   |   [Reset][Save]           #
#                                       #
#################################################################################

Breakpoint Plane

#### Breakpoint Controlls ####
[ ] Disable All
Agents:
    All
        [ ] execute_action
        [ ] execute_llm
    chatbot_agent
        [ ] execute_action
        [ ] execute_llm
    qa_agent
        ...
Logic Units
    All
        [ ] execute_tool_call
    ...

Technical Notes

MVP does not need to worry about horizontal scaling. it is fine to say debugger is single thread only. Pub/Sub concept needed to gracefully break this onstraint
each frame contains id, kind, stack (list of frame names & pointers), variables, and lambdas (execute_llm, execute_tool) (sorry dave)
Frames are be returned as breakpoint events streamed back from sse, this gracefully allows ability to review stack post-execution ^ Frame info will ALWAYS be returned. They will have a flag on them "active" when being streamed, or alteratively a different type of event indicating the frame is frozen
User can review conversation history to see breakpoints and execute lambdas (although they have no ability to save variable changes)
New Endpoints
- crud breakpoints GET / POST debugger/breakpoints
- update frame variables POST debugger/frames/{frame_id}/variables
- execute frame lambdas POST dubugger/frames/{frame_id}/lambdas{lambda_name}/executions
- resume execution POST debugger/frames/{frame_id}/state <- open to feedback ^ frames likely has a get endpoint on it as well for sanity, but this is not strictly needed ^ Executions are not saved to DB ^ Post to /executions takes parameter overrides within body

Dev Plan

To reduce scope we can start with part 2: re-executing llm / tool requests. This allows us to introduce most of the concepts (frames, stack, variables, execution) without needing to worry about enabled/disabled breakpoints, or debugger controls (step into, over) rest api concept or handle resumed execution.

[x] #887
[ ] backend: Introduce concepts of frames. Do not allow them to be set, but stream them to frontend when enabled
[ ] webui: View frames in existing conversation
[ ] backend: ability to execute lambdas
[ ] gtm: Blog Article + MVP Launch
[ ] ui: capability to modify variables (no save yet) and execute lambdas
[ ] backend: Ability to update frame variables
[ ] ui: save button on variables

LukeLalor commented 1 month ago

Questions:

When brainstorming this feature, we talked about several breakpoint locations

action is run (ie, original http request)
pre llm execution
post llm execution
pre tool call
post tool call

I think we might want to narrow it down all the way to just pre-llm execution. At the end of the day, this is a prompt engineering tool, not a debugging tool, and we need to focus accordingly.

flynntsang commented 1 month ago

Reference: Watch the video at https://www.braintrust.dev/blog/announcing-series-a for an example of how they score and evaluate prompts. @parmi02 @LukeLalor

eidolon-ai / eidolon

Prompt Engineering Debugger #846