A Solution Accelerator for the RAG pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences. This includes most common requirements and best practices.
At present, there is an env var AZURE_OPENAI_STREAM which controls streaming when using the byod endpoint, but it is not used at all when using the custom endpoint.
Investigate the feasibility of implementing streamed responses on the custom endpoint. This is probably non-trivial because:
We would not want to stream the initial LLM response if it is a function call, but instead stream the response of the function call
There are a number of post-response steps that occur after the response is generate, such as calling content safety, and storing the result in the index
Motivation
At present, there is an env var
AZURE_OPENAI_STREAM
which controls streaming when using the byod endpoint, but it is not used at all when using the custom endpoint.Investigate the feasibility of implementing streamed responses on the custom endpoint. This is probably non-trivial because:
How would you feel if this feature request was implemented?
Requirements
A list of requirements to consider this feature delivered
/custom
endpointTasks
To be filled in by the engineer picking up the issue