Azure-Samples / aoai-realtime-audio-sdk

Azure OpenAI code resources for using gpt-4o-realtime capabilities.
MIT License
660 stars 117 forks source link

Azure OpenAI GPT-4o Audio and /realtime: Public Preview Documentation

Welcome to the Public Preview for Azure OpenAI /realtime using gpt-4o-realtime-preview! This repository provides documentation, standalone libraries, and sample code for using /realtime -- applicable to both Azure OpenAI and standard OpenAI v1 endpoint use.

Overview: what's /realtime?

This preview introduces a new /realtime API endpoint for the gpt-4o-realtime-preview model family. /realtime:

/realtime is built on the WebSockets API to facilitate fully asynchronous streaming communication between the end user and model. It's designed to be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections; it is not designed to be used directly from untrusted end user devices, and device details like capturing and rendering audio data are outside the scope of the /realtime API.

At a summary level, the architecture of an experience built atop /realtime looks something like the following (noting that the user interactions, as previously mentioned, are not part of the API itself):

sequenceDiagram
  actor User as End User
  participant MiddleTier as /realtime host
  participant AOAI as Azure OpenAI
  User->>MiddleTier: Begin interaction
  MiddleTier->>MiddleTier: Authenticate/Validate User
  MiddleTier--)User: audio information
  User--)MiddleTier: 
  MiddleTier--)User: text information
  User--)MiddleTier: 
  MiddleTier--)User: control information
  User--)MiddleTier: 
  MiddleTier->>AOAI: connect to /realtime
  MiddleTier->>AOAI: configure session
  AOAI->>MiddleTier: session start
  MiddleTier--)AOAI: send/receive WS commands
  AOAI--)MiddleTier: 
  AOAI--)MiddleTier: create/start conversation responses
  AOAI--)MiddleTier: (within responses) create/start/add/finish items
  AOAI--)MiddleTier: (within items) create/stream/finish content parts

Note that /realtime is in public preview. API changes, code updates, and occasional service disruptions are expected.

How to get started

Connecting to and authenticating with /realtime

The /realtime API requires an existing Azure OpenAI resource endpoint in a supported region. A full request URI can be constructed by concatenating:

  1. The secure WebSocket (wss://) protocol
  2. Your Azure OpenAI resource endpoint hostname, e.g. my-aoai-resource.openai.azure.com
  3. The openai/realtime API path
  4. An api-version query string parameter for a supported API version -- initially, 2024-10-01-preview
  5. A deployment query string parameter with the name of your gpt-4o-realtime-preview model deployment

Combining into a full example, the following could be a well-constructed /realtime request URI:

wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview-1001

To authenticate:

API concepts

API details

Once the WebSocket connection session to /realtime is established and authenticated, the functional interaction takes place via sending and receiving WebSocket messages, herein referred to as "commands" to avoid ambiguity with the content-bearing "message" concept already present for inference. These commands each take the form of a JSON object. Commands can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.

For a full, structured description of request and response commands, see realtime-openapi3.yml. As with other aspects of the public preview, note that the protocol specifics may be subject to change.

Session configuration and turn handling mode

Often, the first command sent by the caller on a newly-established /realtime session will be a session.update payload. This command controls a wide set of input and output behavior, with output and response generation portions then later overrideable via response.create properties, if desired.

One of the key session-wide settings is turn_detection, which controls how data flow is handled between the caller and model:

Transcription of user input audio is opted into via the input_audio_transcription property; specifying a transcription model (whisper-1) in this configuration will enable the delivery of conversation.item.audio_transcription.completed events.

An example session.update that configures several aspects of the session, including tools, follows. Note that all session parameters are optional; not everything needs to be configured!

{
  "type": "session.update",
  "session": {
    "voice": "alloy",
    "instructions": "Call provided tools if appropriate for the user's input.",
    "input_audio_format": "pcm16",
    "input_audio_transcription": {
      "model": "whisper-1"
    },
    "turn_detection": {
      "threshold": 0.4,
      "silence_duration_ms": 600,
      "type": "server_vad"
    },
    "tools": [
      {
        "type": "function",
        "name": "get_weather_for_location",
        "description": "gets the weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": [
                "c",
                "f"
              ]
            }
          },
          "required": [
            "location",
            "unit"
          ]
        }
      }
    ]
  }
}

Summary of commands

See realtime-openapi3.yml for full parameter details.

Requests: commands sent from the caller to the /realtime endpoint

type Description
Session Configuration
session.update Configures the connection-wide behavior of the conversation session such as shared audio input handling and common response generation characteristics. This is typically sent immediately after connecting but can also be sent at any point during a session to reconfigure behavior after the current response (if in progress) is complete.
Input Audio
input_audio_buffer.append Appends audio data to the shared user input buffer. This audio will not be processed until an end of speech is detected in the server_vad turn_detection mode or until a manual response.create is sent (in either turn_detection configuration).
input_audio_buffer.clear Clears the current audio input buffer. Note that this will not impact responses already in progress.
input_audio_buffer.commit Commits the current state of the user input buffer to subscribed conversations, including it as information for the next response.
Item Management For establishing history or including non-audio item information
conversation.item.create Inserts a new item into the conversation, optionally positioned according to previous_item_id. This can provide new, non-audio input from the user (like a text message), tool responses, or historical information from another interaction to form a conversation history prior to generation.
conversation.item.delete Removes an item from an existing conversation
conversation.item.truncate Manually shortens text and/or audio content in a message, which may be useful in situations where faster-than-realtime model generation produced significant additional data that was later skipped by an interruption.
Response Management
response.create Initiates model processing of unprocessed conversation input, signifying the end of the caller's logical turn. server_vad turn_detection mode will automatically trigger generation at end of speech, but response.create must be called in other circumstances (text input, tool responses, none mode, etc.) to signal that the conversation should continue. Note: when responding to tool calls, response.create should be invoked after the response.done command from the model that confirms all tool calls and other messages have been provided.
response.cancel Cancels an in-progress response.

Responses: commands sent by the /realtime endpoint to the caller

type Description
session.created Sent as soon as the connection is successfully established. Provides a connection-specific ID that may be useful for debugging or logging.
session.updated Sent in response to a session.update event, reflecting the changes made to the session configuration.
Caller Item Acknowledgement
conversation.item.created Provides acknowledgement that a new conversation item has been inserted into a conversation.
conversation.item.deleted Provides acknowledgement that an existing conversation item has been removed from a conversation.
conversation.item.truncated Provides acknowledgement that an existing item in a conversation has been truncated.
Response Flow
response.created Notifies that a new response has started for a conversation. This snapshots input state and begins generation of new items. Until response.done signifies the end of the response, a response may create items via response.output_item.added that are then populated via *delta* commands.
response.done Notifies that a response generation is complete for a conversation.
rate_limits.updated Sent immediately after response.done, this provides the current rate limit information reflecting updated status after the consumption of the just-finished response.
Item Flow in a Response
response.output_item.added Notifies that a new, server-generated conversation item is being created; content will then be populated via incremental add_content messages with a final response.output_item.done command signifying the item creation has completed.
response.output_item.done Notifies that a new conversation item has completed its addition into a conversation. For model-generated messages, this is preceded by response.output_item.added and *delta* commands which begin and populate the new item, respectively.
Content Flow within Response Items
response.content_part.added Notifies that a new content part is being created within a conversation item in an ongoing response. Until response_content_part_done arrives, content will then be incrementally provided via appropriate *delta* commands.
response.content_part.done Signals that a newly created content part is complete and will receive no further incremental updates.
response.audio.delta Provides an incremental update to a binary audio data content part generated by the model.
response.audio.done Signals that an audio content part's incremental updates are complete.
response.audio_transcript.delta Provides an incremental update to the audio transcription associated with the output audio content generated by the model.
response.audio_transcript.done Signals that the incremental updates to audio transcription of output audio are complete.
response.text.delta Provides an incremental update to a text content part within a conversation message item.
response.text.done Signals that the incremental updates to a text content part are complete.
response.function_call_arguments.delta Provides an incremental update to the arguments of a function call, as represented within an item in a conversation.
response.function_call_arguments.done Signals that incremental function call arguments are complete and that accumulated arguments can now be used in their entirety.
User Input Audio
input_audio_buffer.speech_started When using configured voice activity detection, this command notifies that a start of user speech has been detected within the input audio buffer at a specific audio sample index.
input_audio_buffer.speech_stopped When using configured voice activity detection, this command notifies that an end of user speech has been detected within the input audio buffer at a specific audio sample index. This will automatically trigger response generation when configured.
conversation.item.input_audio_transcription.completed Notifies that a supplementary transcription of the user's input audio buffer is available. This behavior must be opted into via the input_audio_transcription property in session.update.
conversation.item_input_audio_transcription.failed Notifies that input audio transcription failed.
input_audio_buffer_committed Provides acknowledgement that the current state of the user audio input buffer has been submitted to subscribed conversations.
input_audio_buffer_cleared Provides acknowledgement that the pending user audio input buffer has been cleared.
Other
error Indicates that something went wrong while processing data on the session. Includes an error message that provides additional detail.

Troubleshooting and FAQ

Best practices and expected patterns are evolving rapidly and topics represented in this section may become quickly out of date.

I send audio, but see no commands back from the service

Tool calling isn't working or isn't responding

As a single response can feature multiple tool calls, a bit of statefulness is introduced with the tool call/response contract:

Using an audio file as input, I see many responses or my responses get stuck

When using lengthy audio input that's significantly faster than real time -- such as from an audio file with natural pauses -- server voice activity detection can trigger many responses in rapid succession and this can cause responses to become unreliable. It's highly recommended to disable voice activity detection ("turn_detection": { "type": "none" } ("turn_detection": null in newer protocol versions) in session.update) for such scenarios and instead manually invoke response.create when all audio has been transmitted.

What's the long-term plan for library support?

The shortest answer: many details are still TBD.