Token Management
Request Latency
Move warning disable to csproj
Purpose
Top level:
Removed all warnings during compile.
All pragma warning disable in csproj.
ChatPane.razor:
The app now displays the latency for requests. Generated using a stopwatch in Message.cs. This will be helpful for showing cache performance.
Also made changes to how tokens are displayed. More on that below.
ChatService.cs:
Removed redundant NullException checking from a number of functions.
GetChatCompletionAsync():
Moved the call to generate vectors on user prompt into here. I needed the vectors for it in Chat Service for do cache search.
Also moved vector search back into here to better focus GetRagCompletionAsync() to just prepare the call to OpenAI model. (see below)
GetChatSessionContextWindow() now manages conversation history using new env variable, MaxContextWindow which is just conversation depth (prompt + completion) instead of using tokens to limit conversation history used for vector search and cache search. This was not effective given the vast majority of tokens are in the RAG data and not in the text for prompts and completions.
Semantic Kernel.cs:
GetRagCompletionAsync().
New env variable, MaxRagTokens is used to manage the amount of text sent to GPT model. Calls to ML Tokenizer limit the size of RAG data.
Function also can limit the context window size based upon token usage for prompt and completion text. Am just using a local variable for that now. But could probably be pushed into env variable as well.
TrimToTokenLimit():
Calls ML Tokenizer to limit the size of text based upon the max tokens environment variable.
Message.cs
ElaspsedTime:
New property to track how long to generate completions.
Public function CalculateElapsedTime() stops the internal stopwatch and updates the ElaspsedTime property value. Called from UpdateSessionAndMessageAsync() before inserting into Cosmos transaction.
Refactored tokens: Now three properties for tokens: prompt, completion and generation. The result allows for more precise tracking of token consumption for just the context window when it is sent to GetRagCompletionAsync().
Prompt is just tokens for prompt text.
Completion is now just the tokens for the completion text sent back by OpenAI.
Generation is how many were consumed processing the request with all the RAG data.
Does this introduce a breaking change?
[X] Yes
[ ] No
Pull Request Type
[X] Bugfix
[X] Feature
[X] Code style update (formatting, local variables)
[X] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:
How to Test
Ideally should be fully redeployed. But adding these values to secrets.json should also work
Token Management Request Latency Move warning disable to csproj
Purpose
Top level:
ChatPane.razor:
ChatService.cs:
GetChatCompletionAsync():
GetChatSessionContextWindow() now manages conversation history using new env variable, MaxContextWindow which is just conversation depth (prompt + completion) instead of using tokens to limit conversation history used for vector search and cache search. This was not effective given the vast majority of tokens are in the RAG data and not in the text for prompts and completions.
Semantic Kernel.cs:
GetRagCompletionAsync().
TrimToTokenLimit():
Message.cs
ElaspsedTime:
Does this introduce a breaking change?
Pull Request Type
How to Test
Ideally should be fully redeployed. But adding these values to secrets.json should also work
"AZURE_OPENAI_MAX_RAG_TOKENS": "3000", "AZURE_OPENAI_MAX_CONTEXT_TOKENS": "1000", "AZURE_CHAT_MAX_CONTEXT_WINDOW": "3", "Chat:MaxContextWindow": "3", "OpenAi:MaxRagTokens": "3000", "OpenAi:MaxContextTokens": "1000",
Test the code
What to Check
Verify that the following are valid
Other Information