[Call Automation] Played text-to-speech audios are not present in the final recording audio of a call

estebanz01 commented 4 weeks ago

Package Name: azure.communication.callautomation
Package Version: 1.2.0
Operating System: Ubuntu 20.24
Python Version: 3.11

Describe the bug The recordings that are created as wav with unmixed or mixed channels or the recordings created as mp3 with mixed channels do not record the audio being played via call automation API, so the final audio is just the phone number participant audio.

To Reproduce Steps to reproduce the behavior:

Make sure you have server that listens to different call events and a blob storage container to store the audio output.
When Microsoft.Communication.CallConnected is received, start a recording.
After recording, ask for input by passing some text to be played. Do it a couple of times to have meaningful audio length.
Stop recording.

recording_response = call_automation_client.start_recording(
  call_locator=ServerCallLocator(server_call_id),
  recording_content_type=RecordingContent.Audio,
  recording_channel_type=RecordingChannel.Unmixed,
  recording_format_type=RecordingFormat.Wav,
  recording_storage=AzureBlobContainerRecordingStorage(container_url="<url-to-container>")
)

call_connection = call_automation_client.get_call_connection(call_connection_id)

# Play input and recognise
call_connection.start_recognizing_media(
    input_type=RecognizeInputType.SPEECH,
    target_participant=PhoneNumberIdentifier(self.caller_id), # caller_id is the did we receive from the event at CallConnected.
    interrupt_call_media_operation=True,
    play_prompt=TextSource(text="Hola! this is a prompt. Please say anything or interrupt me!", voice_name="en-US-AriaNeural"), # Same happens with SSML Sources
    interrupt_prompt=True)

# Play something
call_connection.play_media_to_all(TextSource(text="Adiós! nos vemos pronto.", voice_name="en-US-AriaNeural")

# goodbye
call_connection.hang_up(is_for_everyone=True)

Expected behavior An recorded audio with both what was played as input sources and what the user said.

Additional context I tried to specify manually the participants in the channel ordering property, but I always get the following errors and it drops the call as the recording cannot start.

INFO - error in event handling [Unable to build a model: Unable to deserialize response data. Data: [<azure.communication.identity._shared.models.CommunicationUserIdentifier object at 0x7fbe180deab0>, <azure.communication.callautomation._shared.models.PhoneNumberIdentifier object at 0x7fbe1727d3d0>], [CommunicationIdentifierModel]]

ERROR - Error in line #231 Msg: Unable to build a model: Unable to deserialize response data. Data: [<azure.communication.identity._shared.models.CommunicationUserIdentifier object at 0x7fbe180deab0>, <azure.communication.callautomation._shared.models.PhoneNumberIdentifier object at 0x7fbe1727d3d0>], [CommunicationIdentifierModel]

github-actions[bot] commented 4 weeks ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @acsdevx-msft.

estebanz01 commented 3 weeks ago

OK, I noticed that if I use call_connection.play_media(play_to="all", play_source=play_source) those prompts get recorded properly, it just don't record the prompts being played when calling call_connection.start_recognizing_media 😭

estebanz01 commented 1 week ago

Based on current communication via support ticket with Microsoft, they say this is "an expected behaviour" rather than a bug 🤣 . I asked for more clarification on that regard. It is really puzzling me right now how this behaviour is a feature, not a bug 🤷

Azure / azure-sdk-for-python

[Call Automation] Played text-to-speech audios are not present in the final recording audio of a call #36880