Azure-Samples / aoai-realtime-audio-sdk

Azure OpenAI code resources for using gpt-4o-realtime capabilities.
MIT License
652 stars 114 forks source link

Strange Behaviors #77

Open Boetty opened 2 weeks ago

Boetty commented 2 weeks ago

Hi everyone, first of all: I really appreciate this code. It has helped me a lot. Thank you for that.

Issues:

  1. The model sometimes responds that it has been trained until 2023, sometimes only until 2021. Why?

2 The model sometimes replies that it can't retrieve information from the internet because it's not connected, but other times it works... Why?

  1. The model "stumbles" at the beginning of the audio initialization. When you click Record and make a request, the model starts by giving strange words and responses initially. Only after a certain time does it respond correctly... Why?

Thank you very much,

Stefan

Boetty commented 2 weeks ago

Additional:

The model sometimes reads phone numbers and sentences incorrectly in the audio output, even though they're correct in the text messages. Example text: Phone number: 1234567890, audio output: "1 2 45 789 54." Why?

Boetty commented 2 weeks ago

Update:

To address the initialization issue where random noise or unintended audio data was sent immediately after starting, we made the following changes in main.ts:

Delay in Starting Real-Time Messages: We added a 1-second delay in the start_realtime() function before invoking handleRealtimeMessages(). This delay allows the audio system to stabilize before sending any initial audio data to the model, reducing the chance of random noise being processed as valid input.

Delay in Starting the Audio Recorder: Within the resetAudio() function, a 500-millisecond delay was introduced before starting the actual audio recording. This provides time for the audio recorder to initialize fully and ensures that no random noise or unintentional sounds are captured at the moment of starting the recording.

Noise Filtering in the Audio Buffer: In the processAudioRecordingBuffer() function, we implemented a noise filter by checking if the audio buffer contains meaningful audio data before sending it. By setting a threshold (e.g., >10), the function only processes buffers with valid audio content, preventing low-level noise or silence from being mistakenly interpreted as input.

Example:

`// main.ts

async function start_realtime() { const { endpoint, apiKey, deploymentOrModel } = await fetchConfigFromProxy();

realtimeStreaming = new LowLevelRTClient(new URL(endpoint), { key: apiKey }, { deployment: deploymentOrModel });

try { await realtimeStreaming.send(createConfigMessage()); } catch (error) { makeNewTextBlock("[Connection error]: Please check the proxy endpoint.", "system-response"); setFormInputState(InputState.ReadyToStart); return; }

// Reset audio recorder and start it with a slight delay to avoid noise await resetAudio(true);

// Delay to ensure initial random signals are not sent immediately setTimeout(() => { handleRealtimeMessages(); }, 1000); // 1-second delay }

async function resetAudio(startRecording: boolean) { recordingActive = false; if (audioRecorder) { audioRecorder.stop(); } if (audioPlayer) { audioPlayer.clear(); } audioRecorder = new Recorder(processAudioRecordingBuffer); audioPlayer = new Player(); audioPlayer.init(24000);

if (startRecording) { const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

// Delay before starting recording to avoid initial random noise
setTimeout(() => {
  audioRecorder.start(stream);
  recordingActive = true;
}, 500); // 500ms delay

} }

function processAudioRecordingBuffer(data: Buffer) { const uint8Array = new Uint8Array(data);

// Check if buffer contains actual audio content (threshold set to filter out noise) if (uint8Array.some((sample) => sample > 10)) { // Adjust threshold as needed combineArray(uint8Array); if (buffer.length >= 4800) { const toSend = new Uint8Array(buffer.slice(0, 4800)); buffer = new Uint8Array(buffer.slice(4800)); const regularArray = String.fromCharCode(...toSend); const base64 = btoa(regularArray); if (recordingActive) { realtimeStreaming.send({ type: "input_audio_buffer.append", audio: base64, }); } } } }`