dotnet / eShopSupport

A reference .NET application using AI for a customer support ticketing system
MIT License
342 stars 82 forks source link

Change assistant to use multiple LLM calls with manual control flow #14

Closed SteveSandersonMS closed 4 months ago

SteveSandersonMS commented 4 months ago

Previously, the assistant was implemented by combining all planning and execution rules into a single LLM prompt, which would return streaming JSON in multiple formats depending on what action it chose to take.

This had one major advantage, which is that it makes the smallest possible number of LLM calls, and hence is cheap to run. However it's also limited in two main ways:

  1. Streaming JSON: since the final "answer" text was itself embedded in a JSON response, it was necessary to parse JSON incrementally as it arrived in a stream of chunks. This is very difficult to do efficiently in System.Text.Json so it was handled by streaming the JSON into JS which used a 3rd-party library for parsing incomplete JSON, which used non-Blazor logic to update the UI as different bits of info arrived (e.g., "search phrase" or "can use as reply to customer" flag). Altogether this had a bit of a Rube Goldberg feel to it.
  2. Limited complexity of behavioral rules: Since all possible assistant actions and rules had to be expressed in a single LLM prompt, this limits the number of rules. LLMs aren't infinitely smart, especially the cheaper ones like Mistral 7B or GPT 3.5 Turbo that this is designed around, and if you give them too many rules they'll just fail to follow them accurately. Even the few rules expressed in the prompt for this app seemed to be at the outer limit of what these LLMs could understand, and adding/changing anything could degrade other behaviors.

The updated approach in this PR follows a pattern that it seems many other use, which is to split apart all the rules into separate LLM prompts, submit these one at a time, and use regular programming for branching and other control flow. The benefit is:

Although the agent will now perform up to 3x as many LLM calls after this PR (1. search, 2. generate answer, 3. determine if the answer is suitable to send directly to the customer), it represents a more realistic, scalable (in complexity) pattern.

Evaluation

// Before PR: After 100 questions: average score = 0.770, average duration = 4374.144ms
// After PR:  After 100 questions: average score = 0.720, average duration = 5270.460ms

The quality score is a little lower but may be explained by me doing a lot of iterations of prompt engineering on the old version, and very little on the new version. Perhaps the newer prompts could be refined.

The increase in duration is obviously expected since we're now waiting for multiple LLM calls.

One could argue that this is a bad change since it worsens time, cost, and quality. However I think it's a more realistic app pattern that would scale up to a wider range of scenarios, so I'm going ahead with it.

If we later get signal that developers prefer the model of merging everything for a RAG system into a single, JSON-returning prompt, we can reconsider.

SteveSandersonMS commented 4 months ago

cc @stephentoub - you may find this interesting since it ties directly to our prior discussions about whether or not developers will inevitably need to parse streaming JSON. I did more research around this topic and found that there seem to be 3 main patterns people use for getting structured output in RAG or agent-like systems:

  1. JSON for everything. For streaming, this necessitates either a true streaming parser or at least a fault-tolerant parser you can call multiple times. External example: https://www.boundaryml.com/blog/nextjs-rag-streaming-example
  2. Plain text for everything, where the text contains mini-syntaxes like bits of XML/JSON/etc to represent structured data for search requests, citations or other flags. External example: https://medium.com/@yotamabraham/in-text-citing-with-langchain-question-answering-e19a24d81e39
  3. Multiple LLM prompts, some returning JSON, some returning text, tied together with regular control flow logic. External example: SK's "Chat Copilot" sample

While I think all three of these could have a legitimate place, approach 3 is the easiest to scale up in complexity, so if we need to pick one for a sample I think this will best serve readers (though if industry patterns settle on something else later, we can change).