Azure Speech-To-Text: Chinese accuracy gap between Rest API vs. Java SDK for short audio

wukun2015 commented 3 years ago

Hello,

i have lot of short Chinese audio wave files of 5 seconds or so in hand. When i transcribe them with Azure Speech-To-Text REST API and Java SDK respectively, i found REST API recognition accuracy seems consistently a little bit worse than that of Java SDK, though the gap is less than 1% CER (Character Error Rate).

My comparison is based on the same regions and the gap is consistent. The region i tried are chinaeast2 and eastasia respectively.

REST API:

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#speech-to-text-rest-api-for-short-audio
url='https://eastasia.stt.speech.azure.cn/speech/recognition/conversation/cognitiveservices/v1?language=zh-CN'
headers = { 'Accept': 'application/json;text/xml', 'Content-Type': 'audio/wav;codecs="audio/pcm";samplerate=16000', 'Ocp-Apim-Subscription-Key': , 'format': 'detailed' }

Java SDK (i'm using version 1.12.1)

https://docs.microsoft.com/en-us/java/api/com.microsoft.cognitiveservices.speech.speechrecognizer.startcontinuousrecognitionasync?view=azure-java-stable#com_microsoft_cognitiveservices_speech_SpeechRecognizer_startContinuousRecognitionAsync__
Method: SpeechConfig.fromSubscription("", "eastasia")

Is such accuracy gap expected ?

Thank you.

BrianMouncer commented 3 years ago

I would expect the two API surfaces to have almost identical accuracy levels. As long as you are using the same endpoints and locales, they basically use the same backend service. The potential for differences would be around the front end and how you stream audio to the rest service, and retry logic for disconnection and error handling. The java sdk has more robust error handling, and will actually backup in the audio buffers, reconnect, and replay audio to recover from some minor errors, where most people do not do that in their code to call the rest API, so you will either error out, or drop some results.

Do you happen to have any audio files and the expected transcriptions that you have seen this discrepancy with?

wukun2015 commented 3 years ago

Thanks for clarify, @BrianMouncer .

In my REST api call, i did retry max 10 times. Here is script i'm running for your reference: https://github.com/speechio/leaderboard/blob/master/models/microsoft_api_zh/asr_api.py

Yes, glad to share test data. Given the big data size, would you share your contact to discuss data share how-to detail?

Thank you.

BrianMouncer commented 3 years ago

@wukun2015

Thanks for sending me the audio data and your test results. I suspected that maybe the audio bit rate or number of channels was different, as the Speech SDK does a better job of automatically reading the RIFF format and sending it along to the service, so it is easy mistake to overlook with the REST api. However, the files are all the same, and are 16b 16khz mono that we work with best format for us...so that did not explain it.

One of our other devs pointed me to this answer on the REST apis from last year, where a customer was getting slightly different results from REST and SDK even when using the same custom speech models.

The summary of that answer is, even with the same models loaded, the two services allocate slightly different cloud resources, and have slightly different post processing, so it is expected that they will not be exactly the same. The rest API is working on making those differences more configurable, but I do not know the status of those improvements. @wolfma61 might be able to give you more details on that work.

https://stackoverflow.microsoft.com/questions/185766/

If you have some specific examples of recognitions that are wrong in both cases, I would be happy to forward them to our modeling team so see if they can do anything to improve the global models in those area.

You can also consider building custom models specifically for the subject domain you are trying to recognize, to that you can adapt the generic global models to perform better for the audio data you are trying to specialize on.

wolfma61 commented 3 years ago

unfortunately the service internal pipelines between the REST api and the SDK (websocket connection) are not identical. There are slight differences between both pipelines in processing and audio conversion. This might result in slightly different results (unfortunately), although the same model is being used.

The REST pipeline is the 'older' pipeline, there is an internal work item to unify the pipelines to reduce code duplication and get to the identical results. It isn't scheduled yet, but should happen in the next several month.

pankopon commented 3 years ago

Closed as answered since no further updates in two months and no specific ETA for service changes.

Azure-Samples / cognitive-services-speech-sdk

Azure Speech-To-Text: Chinese accuracy gap between Rest API vs. Java SDK for short audio #1230