Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.85k stars 1.84k forks source link

Compatibility issue with generated speech audio in WAV format #1450

Closed mudssrali closed 2 years ago

mudssrali commented 2 years ago

We're going to heavily rely on azure cognitive text-to-speech API to convert textual information to speech. Further we're passing generated speech audio (WAV) to an IVR service which says audio file must have following codec configuration:

Codec: PCMS16LE (araw)
Channel: Mono 
Sample Rate: 8000
Bits per Sample :16

To make sure we're aligned with this configuration we used:

Riff8Khz16BitMonoPcm | Raw8Khz16BitMonoPcm (file format error) 

We can double-check generated speech audio file has configuration as same as required by an IVR service. When uploading generated speech audio to IVR, it's responding with an error however works fine any other wav file generated from audio tools. See the attached audio files.

We used an online tool - (metadata2go) to see metadata of both file, but didn't able to catch such difference, here's metadata link for both files

To further more dive into this issue, we tried passing azure generated speech audio file to a tool - called 3Cx - to convert file to wav format again, this time IVR service accepts this converted file and working fine there.

Here's converted file (success.wav ---> converted_success.wav): Converted - Azure Speech File

With the help of metadata tool, we can see audio file Raw Header to see what happened after converting azure speech audio file.

Raw Header(success.wav)

52 49 46 46 00 00 00 00 57 41 56 45 66 6D 74 20 10 00 00 00 01 00 01 00 40 1F 00 00 80
3E 00 00 02 00 10 00 64 61 74 61 BE 1B 03 00 00 00 FF FF 00 00 01 00 FF FF 00 00 00 00
00 00 00 00 FF FF 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00

Raw Header(converted_success.wav)

52 49 46 46 E2 1B 03 00 57 41 56 45 66 6D 74 20 10 00 00 00 01 00 01 00 40 1F 00 00 80 3E
00 00 02 00 10 00 64 61 74 61 BE 1B 03 00 00 00 FF FF 00 00 01 00 FF FF 00 00 00 00 00 00
00 00 FF FF 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00

Raw Header (Audio Tool - 3thKisaan.wav)

52 49 46 46 88 73 07 00 57 41 56 45 66 6D 74 20 12 00 00 00 01 00 01 00 40 1F 00 00 80 3E 00
00 02 00 10 00 00 00 64 61 74 61 62 73 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00

We can see clearly after conversion, azure speech audio file's a few starting hex-codes of Raw Header are changed to E2 1B 03. How we can resolve this Azure Cognitive Services?

yulin-li commented 2 years ago

duplicated of https://github.com/microsoft/cognitive-services-speech-sdk-js/issues/513, close here