Closed SteveSandersonMS closed 4 months ago
cc @stephentoub - you may find this interesting since it ties directly to our prior discussions about whether or not developers will inevitably need to parse streaming JSON. I did more research around this topic and found that there seem to be 3 main patterns people use for getting structured output in RAG or agent-like systems:
While I think all three of these could have a legitimate place, approach 3 is the easiest to scale up in complexity, so if we need to pick one for a sample I think this will best serve readers (though if industry patterns settle on something else later, we can change).
Previously, the assistant was implemented by combining all planning and execution rules into a single LLM prompt, which would return streaming JSON in multiple formats depending on what action it chose to take.
This had one major advantage, which is that it makes the smallest possible number of LLM calls, and hence is cheap to run. However it's also limited in two main ways:
The updated approach in this PR follows a pattern that it seems many other use, which is to split apart all the rules into separate LLM prompts, submit these one at a time, and use regular programming for branching and other control flow. The benefit is:
Although the agent will now perform up to 3x as many LLM calls after this PR (1. search, 2. generate answer, 3. determine if the answer is suitable to send directly to the customer), it represents a more realistic, scalable (in complexity) pattern.
Evaluation
The quality score is a little lower but may be explained by me doing a lot of iterations of prompt engineering on the old version, and very little on the new version. Perhaps the newer prompts could be refined.
The increase in duration is obviously expected since we're now waiting for multiple LLM calls.
One could argue that this is a bad change since it worsens time, cost, and quality. However I think it's a more realistic app pattern that would scale up to a wider range of scenarios, so I'm going ahead with it.
If we later get signal that developers prefer the model of merging everything for a RAG system into a single, JSON-returning prompt, we can reconsider.