Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.81k stars 1.82k forks source link

[es-US] Bilingual model barely recognizes Spanish word "Sí" and recognizes "See" or even "C" instead #2213

Open RomanValov opened 8 months ago

RomanValov commented 8 months ago

Describe the bug

We are using Speech SDK to drive a bot answering calls and guiding callers through a series of questions to connect to somebody. Our bot can handle calls in English and Spanish languages. Questions asked by the bot usually imply simple answers like Yes (Sí) or No (No). Starting somewhere in December 2023 we have noticed that our callers experience issues passing through the bot questions and according to our logs when caller said "Sí", Speech SDK could recognize it as nothing (empty string), "See" or even "C". Also recognition usually takes longer when the word "Sí" to be recognized.

Since we are using "es-US" model to recognize Spanish language we presume that root cause of issues is the introduction of bilingual feature for this locale. For now we switched to "es-MX" locale which works well. But wish to get back to "es-US" model.

To Reproduce

To reproduce the issue I have collected several samples which pretty easily show differences between "es-US" and "es-MX" models. I was using samples/java/jre/console sample 6 (recognitionWithAudioStreamAsync) to run tests. The only modification is to set locale on SpeechRecognizer instantiation.

Here are the samples and relevant recognition logs for es-MX and es-US locales:

sample0.tar.gz

es-mx

RECOGNIZING: Text=español
RECOGNIZED: Text=Español.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
RECOGNIZING: Text=2
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
CANCELED: Reason=EndOfStream

es-us

RECOGNIZING: Text=espanol
RECOGNIZED: Text=Espanol.
RECOGNIZED: Text=See.
RECOGNIZING: Text=2
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZED: Text=See.
CANCELED: Reason=EndOfStream

sample1.tar.gz

es-mx

RECOGNIZING: Text=español
RECOGNIZED: Text=Español.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
CANCELED: Reason=EndOfStream

es-us

RECOGNIZING: Text=espanol
RECOGNIZED: Text=Espanol.
RECOGNIZED: Text=
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZING: Text=see
RECOGNIZED: Text=See.
CANCELED: Reason=EndOfStream

sample2.tar.gz

es-mx

RECOGNIZING: Text=español
RECOGNIZED: Text=Español.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
RECOGNIZING: Text=2
RECOGNIZING: Text=25
RECOGNIZING: Text=254
RECOGNIZING: Text=2540
RECOGNIZED: Text=2540.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
CANCELED: Reason=EndOfStream

es-us

RECOGNIZING: Text=espanol
RECOGNIZED: Text=Espanol.
RECOGNIZING: Text=C
RECOGNIZED: Text=C.
RECOGNIZING: Text=2
RECOGNIZING: Text=25
RECOGNIZING: Text=254
RECOGNIZING: Text=2540
RECOGNIZED: Text=2540.
RECOGNIZING: Text=C
RECOGNIZED: Text=C.
RECOGNIZING: Text=play
CANCELED: Reason=EndOfStream

It's worth noting that recognition issues are stably reproduced when audio from the caller passed thru a chain of telephony systems and format conversion (as in examples above). However when attempted to record the voice locally and feed it directly to Speech SDK recognition I wasn't able to reproduce the issue. The sample below is an example of this. Besides correctly recognizing word "Sí" it also recognizes "Español" as Spanish word (with "ñ") as opposed to other cases where the word is recognized as "Espanol" (with "n").

sample3.tar.gz

es-mx

RECOGNIZING: Text=españ
RECOGNIZING: Text=español
RECOGNIZED: Text=Español.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
RECOGNIZING: Text=2
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=2 cuatro si
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
CANCELED: Reason=EndOfStream

es-us

RECOGNIZING: Text=español
RECOGNIZED: Text=Español.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
RECOGNIZING: Text=2
RECOGNIZING: Text=2 cuatro
RECOGNIZING: Text=247
RECOGNIZING: Text=2471
RECOGNIZED: Text=2471.
RECOGNIZING: Text=sí
RECOGNIZED: Text=Sí.
CANCELED: Reason=EndOfStream

Expected behavior

A single word "Sí" is correctly recognized.

Version of the Cognitive Services Speech SDK

1.31.0

Platform, Operating System, and Programming Language

jpalvarezl commented 8 months ago

Hi @RomanValov , people working more closely to this feature have been contacted and we should be able to provide you with an update soon on this issue. Thank you for using the SDK and submitting the issue!

RomanValov commented 8 months ago

Following up discussion with @BrianMouncer at #2214 :

Were you able to try adding phrase list grammars to your app, to change the weights the engine uses for those words in the general purpose bilingual models? I am guessing that is the simplest and quickest way to work around the current issue with "Si" and other very short words.

Tbh I wasn't able to get any results with Phrase List. Language support page doesn't mark es-US as one supporting Phrase Lists. I know that en-US can recognize some Spanish words, so I tried with it. And I've only got "Sí" recognized with en-USy once among ~10 attempts.

Currently we use es-MX as a workaround. There is another objection against Phrase List. As you can see in bug description English version of the word "Espanol" is also takes precedence over Spanish "Español" in our samples. So I'm afraid apparently other Spanish words could be improperly recognized too. Seems the overall point is that due to some quality lose English words take precedence over Spanish ones. Synthetic voice used to record samples 0 and 3 is the same. But sample 0 is recorded on our telephony server whereas sample 3 is recorded directly from the microphone. The voices and quality is hardly distinguishable by a human ear. But their recognition results drastically different.

github-actions[bot] commented 7 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

pankopon commented 6 months ago

To update, the model team acknowledged the issue and consider updating the model (no ETA at the moment). They noted that

with the es-MX model update (last Nov), it’s now actually capable handling es-US assistant scenario very well, it will be much better than previous en-US model (before last Dec), all assistant command have been evaluated there. Your approach that switch to es-MX is current best approach.

github-actions[bot] commented 5 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.