Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.69k stars 1.79k forks source link

Error using .ogg/.mp3 audio files for Python SDK #2473

Open g-tyagi opened 3 days ago

g-tyagi commented 3 days ago

Issue

Azure Speech to text API has support for audio/ogg Content-type but I am unable to transcribe .ogg or .mp3 files over Python SDK.

Are these formats supported by Python speech SDK? What are the required parameters?

Error log

[CALL STACK BEGIN]

3   libMicrosoft.CognitiveServices.Spee 0x0000000106a553e8 _ZN9Microsoft17CognitiveServices6Speech4Impl17CSpxWavFileReader23FindFormatAndDataChunksEv + 1052
4   libMicrosoft.CognitiveServices.Spee 0x0000000106a54830 _ZN9Microsoft17CognitiveServices6Speech4Impl17CSpxWavFileReader9GetFormatEPNS2_15SPXWAVEFORMATEXEt + 84
5   libMicrosoft.CognitiveServices.Spee 0x0000000106a60d60 _ZNK9Microsoft17CognitiveServices6Speech4Impl41ISpxAudioSourceControlAdaptsAudioPumpImplINS2_26CSpxFileAudioSourceAdapterEE9GetFormatEv + 136
6   libMicrosoft.CognitiveServices.Spee 0x0000000106a5b300 _ZThn56_NK9Microsoft17CognitiveServices6Speech4Impl22CSpxAudioSourceWrapper9GetFormatEv + 68
7   libMicrosoft.CognitiveServices.Spee 0x0000000106ba8fa4 _ZThn8_N9Microsoft17CognitiveServices6Speech4Impl20CSpxAudioSessionShim9GetFormatEv + 112
8   libMicrosoft.CognitiveServices.Spee 0x0000000106af73e4 _ZN9Microsoft17CognitiveServices6Speech4Impl22CSpxAudioStreamSession33SetAudioConfigurationInPropertiesEv + 56
9   libMicrosoft.CognitiveServices.Spee 0x0000000106af7070 _ZN9Microsoft17CognitiveServices6Speech4Impl22CSpxAudioStreamSession12InitFromFileEPKc + 364
10  libMicrosoft.CognitiveServices.Spee 0x0000000106b3c2c4 _ZN9Microsoft17CognitiveServices6Speech4Impl20CSpxSpeechApiFactory31InitSessionFromAudioInputConfigENSt3__110shared_ptrINS2_26ISpxAudioStreamSessionInitEEENS5_INS2_15ISpxAudioConfigEEE + 624
11  libMicrosoft.CognitiveServices.Spee 0x0000000106b39d4c _ZN9Microsoft17CognitiveServices6Speech4Impl20CSpxSpeechApiFactory45CreateTranslationRecognizerFromConfigInternalENSt3__110shared_ptrINS2_15ISpxAudioConfigEEE + 552
12  libMicrosoft.CognitiveServices.Spee 0x0000000106b3a26c _ZThn48_N9Microsoft17CognitiveServices6Speech4Impl20CSpxSpeechApiFactory37CreateTranslationRecognizerFromConfigENSt3__110shared_ptrINS2_15ISpxAudioConfigEEE + 68
13  libMicrosoft.CognitiveServices.Spee 0x00000001068e62cc recognizer_create_translation_recognizer_from_config + 1392
14  libffi.dylib                        0x000000019ea84050 ffi_call_SYSV + 80
15  libffi.dylib                        0x000000019ea8cae0 ffi_call_int + 1212
16  _ctypes.cpython-39-darwin.so        0x000000010520b3ec PyInit__ctypes + 25288
17  _ctypes.cpython-39-darwin.so        0x0000000105203f6c _ctypes.cpython-39-darwin.so + 16236
18  Python3                             0x000000010596cd58 _PyObject_Call + 172
19  Python3                             0x0000000105a3be50 _PyEval_EvalFrameDefault + 23428
[CALL STACK END]

Exception with an error code: 0xa (SPXERR_INVALID_HEADER)

Code snippet

    speech_translation_config = speechsdk.translation.SpeechTranslationConfig(subscription="", region="")
    speech_translation_config.speech_recognition_language="en-US"
    speech_translation_config.add_target_language("es")
    audio_config = speechsdk.audio.AudioConfig(filename="filename.ogg")
    translation_recognizer = speechsdk.translation.TranslationRecognizer(translation_config=speech_translation_config, audio_config=audio_config)
    translation_recognition_result = translation_recognizer.recognize_once_async().get()
pankopon commented 3 days ago

Try the attached example: recognize_compressed_audio.zip Replace YourSubscriptionKey and YourServiceRegion in the script with valid values. Note that you must have GStreamer installed for this to work, see https://learn.microsoft.com/azure/ai-services/speech-service/how-to-use-codec-compressed-audio-input-streams?pivots=programming-language-python#gstreamer-configuration