Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.39k stars 4.27k forks source link

Documentation Request: Clarify Backend API Calls for Chat Implementation #2059

Open sam-h-long opened 1 month ago

sam-h-long commented 1 month ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Similar to #836 I am curious about the backend api calls that are needed to create a chat experience. I started by running the azd command steps in the deploy part of the README.md to copy the code locally and set credentials.

Next within the following directory (azure-search-openai-demo/app/backend/) I created a virtual Python env:

backend % /Users/sl/.pyenv/versions/3.11.10/bin/python -m venv .venv_azure_search_demo_v1
backend % source .venv_azure_search_demo_v1/bin/activate
backend % pip install -r requirements.txt
backend % pip install gunicorn
backend % pip install ddtrace

Lastly, ran quart to get the backend to run locally:

backend % quart --app main:app run --port "50505" --host "localhost" --reload

Any log messages given by the failure

NA

Expected/desired behavior

Calling the chat/endpoint the following way: (unsure if my overrides are redundant..?):

import httpx
import asyncio

async def call_chat_endpoint():
    url = 'http://127.0.0.1:50505/chat'
    data = {
        "context": {"overrides": {"retrieval_mode": "text"}
                    # Include any necessary context here
                    },
        "messages": [
            {"role": "user", "content": "Where is Paris?"}
        ],
        "session_state": {}  # Include any session state if needed
    }

    headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer YOUR_AUTH_TOKEN',  # If you need to include a token
    }

    async with httpx.AsyncClient() as client:
        response = await client.post(url, json=data, headers=headers)
        print(response.json())  # Process the response

# Run the async function
asyncio.run(call_chat_endpoint())

Tracing the application with Datadog I am pretty sure the format of the response comes from run_without_streaming().

        chat_app_response = {
            "message": {"content": "Paris is in France and known for its rich history.", "role": "assistant"},
            "context": extra_info,
            "session_state": session_state,
        }

If so, in the frontend my guess implementation is just the previous "message" is passed to the next chat/ call before the next question? For example,

    data = {
        "context": {"overrides": {"retrieval_mode": "text"}},
        "messages": [
            {"role": "user", "content": "Where is Paris?"},
            {"content": "Paris is in France and known for its rich history.", "role": "assistant"},
            {"role": "user", "content": "What history are you referring to?"},
        ],
        "session_state": {}  # Include any session state if needed
    }

In other words, I am trying to understand if the "context" or "session_state" output from run_without_streaming() would get passed to the next chat/ call as well? Any further documentation of APIs needed to simulate a chat experience would be great!

OS and Version?

macOS Sonoma

azd version?

run azd version

azd version 1.10.3 (commit 0595f33fe948ee6df3da492567e3e7943cb9a733)

Versions

NA

Mention any other details that might be useful

While some additional documentation on the prompt sequences would be useful, overall this is an a really great repository. In contrast, to sample-app-aoai-chatGPT I have found understanding the calls made to the Search Client & Azure OpenAI APIs much easier to follow. Great work and thank you 🙌 🙌

pamelafox commented 1 month ago

The backend follows the protocol described here: https://github.com/microsoft/ai-chat-protocol/tree/main/spec#microsoft-ai-chat-protocol-api-specification-version-2024-05-29

Let me know if you have additional questions after reading that.

gaborvar commented 1 month ago

Hi @sam-h-long, I went down a similar path, happy to share. Very useful codebase on both the frontend and the backend. Some aspects I should note: session_state is useful to send state from backend to the next call to backend, as you say. The frontend preserves it honestly. However, it does not get passed all the way to where you may want to reuse it because in chatapproach.py run_with_streaming() does not pass it on to self.run_until_final_call() where the model calls actually take place. You either have to move your model calls to run_with_streaming() or you need to extend argument list and return value for self.run_until_final_call() to include session_state. Both impact maintainability. It is unfortunate, as the codebase follows a great architecture that would neatly allow coexistence of many scenarios organised into a hierarchy of several gradually specialised layers, using class extensions. This would allow to avoid forking the code base and still support new scenarios. The code base and the interfaces can support function calls initiated by the model. Currently this is not enabled because the openai-messages-token-helper lib does not assume other participants in the conversation than the usual user sand assistant, but it is just some parameter checking being too strict. Can be fixed easily, see https://github.com/pamelafox/openai-messages-token-helper/issues/20

context (from extra_info) is useful to send information from the backend to the frontend. Heavily leveraged for extras like citation links, thought process, anything additional that the model API itself does not inherently handle. The frontend re-generates the message history from the chat UI every time before it makes a call to the backend. This is fair but removes any extra you may want to inform the backend about from the previous backend call. Hence the importance of session_state. It is fine, however separating UI-bound and non-UI-bound data into two constructs requires some extra data manipulation in code. There are other useful calls to the backend. I reuse the file access call which was created to download documents referenced in citations. API is accessible under the /content/[filename] URL. Let me know if you are interested in more sharing.