[Feature]: Support of `gpt-4o-audio-preview`

Clad3815 commented 3 days ago

The Feature

[x] #6299
[x] #6300
[x] #6301
[x] #6302
[x] #6303
[ ] #6305

Now we are able to send / receive audio data from OpenAI chat completions endpoint. Here is the test script:

// import { writeFileSync } from "node:fs";
// import OpenAI from "openai";

const { OpenAI } = require('openai');
const fs = require('fs');
require('dotenv').config();

const openai = new OpenAI({
    // apiKey: process.env.OPENAI_API_KEY
    apiKey: "sk-1234",
    baseURL: "http://localhost:4000/v1"
});

async function testAudio() {
    // Generate an audio response to the given prompt
    const response = await openai.chat.completions.create({
        model: "gpt-4o-audio-preview",
        modalities: ["text", "audio"],
        audio: { voice: "alloy", format: "wav" },
        messages: [
            {
                role: "user",
                content: "Is a golden retriever a good family dog?"
            }
        ]
    });

    // Inspect returned data
    console.log(response.choices[0]);

    // Write audio data to a file
    fs.writeFileSync(
        "dog.wav",
        Buffer.from(response.choices[0].message.audio.data, 'base64'),
        { encoding: "utf-8" }
    );

}

testAudio();

It's not work as the chat completions not return the "audio.data" field. We need to check about sending audio also

const response = await openai.chat.completions.create({
  model: "gpt-4o-audio-preview",
  modalities: ["text", "audio"],
  audio: { voice: "alloy", format: "wav" },
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What is in this recording?" },
        { type: "input_audio", input_audio: { data: base64str, format: "wav" }}
      ]
    }
  ]
});

Motivation, pitch

https://platform.openai.com/docs/guides/audio/quickstart?lang=javascript&audio-generation-quickstart-example=audio-out

Twitter / LinkedIn details

No response

krrishdholakia commented 2 days ago

Woah, that's great! Yes, definitely

krrishdholakia commented 2 days ago

I believe audio input should work. For audio output it looks like we need to support parsing out a new field -

"message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "audio": {
          "id": "audio_6711b7a624708190866a471c247be0b7",
          "data": "UklGRmZLDQBXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAATElTVBoAAABJTkZPSVNGVA4AAABMYXZmNTguMjkuMTAwAGRhdGEgSw0ADQAJAAUADgAEAAsACAALAAw

ishaan-jaff commented 2 days ago

picking this up - updated ticket with sub tasks

krrishdholakia commented 1 day ago

We should also add (in order of priority):

[x] message redaction support
[ ] caching support
[ ] gemini audio support via the new audio_input flow

@ishaan-jaff

BerriAI / litellm