SpeechRecognizer - changing languages

anne-a commented 4 months ago

I would like to implement speech-to-text for a case where there will be multiple spoken languages - one person will speak, and then afterwards someone else will speak in a different language. I want to manually change the input language when the 2nd person starts speaking. Currently, I am doing this using the following code:

speechRecognizer.Properties.SetProperty(PropertyId.SpeechServiceConnection_RecoLanguage, spokenLanguage);
speechRecognizer.Properties.SetProperty(PropertyId.SpeechServiceConnection_EndpointId, endpointId);

It generally works, however, I have noticed that when the SpeechRecognizer is initialised using a custom endpoint (setting the EndpointId on the SpeechConfig before creating the SpeechRecognizer), changing the language this way no longer works, it continues with the previous language. The language returned on the Recognizer changes, but the text returned is in the previous language. Is this a bug? Or is this not a recommended approach? Would it be better to recreate the SpeechRecognizer when changing the language? I know I could potentially use auto-detect to detect the new language, but it isn't really needed here as I already know ahead of time that the language will be changed and what it will be changed to.

pankopon commented 4 months ago

Hi, what do you mean by "The language returned on the Recognizer changes"? If you read PropertyId.SpeechServiceConnection_RecoLanguage then it's just the value last written by your application. Or are you using language identification?

Using a custom endpoint, you have trained a custom speech model, right? There may be an inherent limitation on changing languages on the fly but we will need to check from the service side. You could first try if it makes a difference that you stop recognition, change the language, then start recognition again. Also, creating a new SpeechRecognizer is guaranteed to re-set the configuration and start anew.

anne-a commented 4 months ago

Hi @pankopon, thanks for your reply! When I said the language returned on the Recognizer changes, I meant the value of the ((SpeechRecognizer)sender).SpeechRecognitionLanguage property on the Recognized event, that changed, and same with the RecoLanguage property. All the language-related properties return the new language, but yet it's still recognising in the original language, only when using a custom endpoint, and yes it's using a custom speech model. I have a custom endpoint for each language so I'm stopping it, changing both the language and endpoint, and then starting it again, but it stays recognising on the old language. I've just done another quick test using session ID d391602051fe4faea8ea1518830849be incase you can see anything. I changed the language from Spanish to Russian, stopped and started the recognizer, and it stayed recognizing in Spanish.

I'm not using language identification, and I could create a new SpeechRecognizer, it just seems more efficient to reuse it and just change the language and endpoint, especially if I might need to change it a few times.

pankopon commented 4 months ago

Hi, SpeechRecognitionLanguage actually returns the value of PropertyId.SpeechServiceConnection_RecoLanguage so it's just the latest value you wrote and not updated based on the service response. The service only returns a language ID when language identification is enabled.

Can you record and attach a Speech SDK log from the case where you change the language? Set the log filename in SpeechConfig before you create a recognizer. As you are using a custom endpoint, even if it's done by setting EndpointId on SpeechConfig it's possible that the same limitations apply as when using FromEndpoint i.e. the recognition language may not be changed after a connection is established. With a log file we can check this.

anne-a commented 4 months ago

Sure, attached is the log file. log.txt

pankopon commented 4 months ago

Thanks for the log. Looks like the language setting is sent to the service in each case but apparently the change from the initial value has no effect, presuming the response speech phrases (starting at 35100ms) after the ru-RU setting are not as expected.

[778413]: 84ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1116 CSpxAudioStreamSession::StartRecognitionAsync
[162010]: 96ms SPX_TRACE_INFO:  usp_connection.cpp:552 connectionUrl=wss://uksouth.stt.speech.microsoft.com/speech/universal/v2?cid=2ed64348-6e92-4281-85fc-c2274ed8272d
[162010]: 174ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1250 speech.context='{"phraseDetection":{"mode":"CONVERSATION","language":"es-ES", ...
[162010]: 319ms SPX_TRACE_VERBOSE:  web_socket.cpp:540 [0x00000304E9DFB790] Web socket sending message. Time: 2024-07-26T21:28:27.9988687Z, TimeInQueue: 145ms, IsBinary: 0, Path: speech.context, Size:613 B
[162010]: 320ms SPX_TRACE_VERBOSE:  web_socket.cpp:649 [0x00000304E9DFB790] Web socket send message completed. Result: 0, SendTime: 0ms, IsBinary: 0, Path: speech.context, Size:613 B
[162010]: 366ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1881 Response: Turn.Start message. Context.ServiceTag: a9e819e6df5d44fb9a7e5030bc516023
[162010]: 8157ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 0, Text: ¿Hola, cómo estás, me escuchas bien?, starts at 16500000, with duration 18800000 (100ns).
[162010]: 21178ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 0, Text: Ahora puedes cambiar de tema a español, a ruso., starts at 171900000, with duration 31600000 (100ns).
[345411]: 31320ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1138 CSpxAudioStreamSession::StopRecognitionAsync
[162010]: 31478ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 5, Text: , starts at 326000000, with duration 0 (100ns).
[162010]: 31481ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1912 Response: Turn.End message.
[778413]: 31511ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1116 CSpxAudioStreamSession::StartRecognitionAsync
[162010]: 31526ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1250 speech.context='{"phraseDetection":{"mode":"CONVERSATION","language":"ru-RU", ...
[162010]: 31530ms SPX_TRACE_VERBOSE:  web_socket.cpp:540 [0x00000304E9DFB790] Web socket sending message. Time: 2024-07-26T21:28:59.2093768Z, TimeInQueue: 3ms, IsBinary: 0, Path: speech.context, Size:647 B
[162010]: 31530ms SPX_TRACE_VERBOSE:  web_socket.cpp:649 [0x00000304E9DFB790] Web socket send message completed. Result: 0, SendTime: 0ms, IsBinary: 0, Path: speech.context, Size:647 B
[162010]: 31568ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1881 Response: Turn.Start message. Context.ServiceTag: c14fa4c7bbb34d09b32166fcedd7809a
[162010]: 35100ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 0, Text: Pasiva., starts at 347200000, with duration 6000000 (100ns).
[162010]: 47103ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 0, Text: Si pasita., starts at 455100000, with duration 6800000 (100ns).
[285090]: 51965ms SPX_DBG_TRACE_FUNCTION:  audio_stream_session.cpp:1138 CSpxAudioStreamSession::StopRecognitionAsync
[162010]: 52123ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1569 Response: Speech.Phrase message. Status: 5, Text: , starts at 532700000, with duration 0 (100ns).
[162010]: 52125ms SPX_DBG_TRACE_VERBOSE:  usp_reco_engine_adapter.cpp:1912 Response: Turn.End message.

However, I think the cause is what you wrote

I have a custom endpoint for each language so I'm stopping it, changing both the language and endpoint, and then starting it again

The endpoint cannot be changed on the fly after the recognizer is created and connection established (updating PropertyId.SpeechServiceConnection_EndpointId has no effect). The endpoint can be only set before a recognizer is created and this endpoint is then used for the service connection (connectionUrl shown in the log) as long as the same recognizer instance is used.

So if you indeed have language specific endpoints and thus want to change the endpoint at runtime then the only way is to create a new SpeechRecognizer. Please note that C# API documentation has the following remarks about both PropertyId.SpeechServiceConnection_RecoLanguage

Under normal circumstances, you shouldn't have to use this property directly. Instead, use SpeechRecognitionLanguage.

and PropertyId.SpeechServiceConnection_EndpointId

Under normal circumstances, you shouldn't have to use this property directly. Instead use FromEndpoint(Uri, String), or FromEndpoint(Uri).

anne-a commented 4 months ago

Thanks, and yes the change had no effect, the subsequent responses were still in Spanish. In some cases, I see the language change, but 5-10 minutes later which is strange. Maybe it later reconnects and that then updates the connection url.

I saw the remarks in the documentation regarding the RecoLanguage and EndpointId properties, but I didn't think it was saying it wouldn't work, just that normally you would set the values when instantiating and when testing, changing the RecoLanguage works, just not if you are using a custom endpoint it seems.

I'll switch it to create a new SpeechRecognizer. Out of interest, the language identification feature supports using custom endpoints per language. Is this because it uses multiple SpeechRecognizers? I want to do something similar to what it supports in terms of having multiple spoken languages, but switching the language myself as it doesn't support all the languages you offer or using multiple locales for the same language.

pankopon commented 4 months ago

the language identification feature supports using custom endpoints per language. Is this because it uses multiple SpeechRecognizers

It's as described in the example

var sourceLanguageConfigs = new SourceLanguageConfig[] { SourceLanguageConfig.FromLanguage("en-US"), SourceLanguageConfig.FromLanguage("fr-FR", "The Endpoint Id for custom model of fr-FR") };

If the detected language is en-US, the example uses the default model. If the detected language is fr-FR, the example uses the custom model endpoint.

You only need one recognizer instance, the rest is automatic based on the configuration as in the example.

anne-a commented 4 months ago

Thanks, I know it's based on the configuration, but it supports multiple custom endpoints right? I could do something like this:

var sourceLanguageConfigs = new SourceLanguageConfig[]
{
SourceLanguageConfig.FromLanguage("en-US", "The Endpoint Id for custom model of en-US"),
SourceLanguageConfig.FromLanguage("fr-FR", "The Endpoint Id for custom model of fr-FR")
};

I'm wondering how it can switch the endpoint on a single recognizer, or it's doing it in a way that isn't available directly using the SDK?

pankopon commented 4 months ago

In case of language identification this is managed by the service, not the SDK.

anne-a commented 4 months ago

Thanks for your explanations!!

Azure-Samples / cognitive-services-speech-sdk

SpeechRecognizer - changing languages #2522