This PR addresses #149 and offers support for the v2 version of the Speech-To-Text library whilst still supporting v1 simultaneously. The default behaviour is to use the v1 version of the library where everything works identically to the way it did in the previous version. In order to use v2 the FreeSWITCH variable GOOGLE_SPEECH_CLOUD_SERVICES_VERSION must be set to the value "v2". Setting it to "v1" or not setting it at all results in the default behaviour.
If the variable is used then it is essential to provide a so called recognizer parent path in the GOOGLE_SPEECH_RECOGNIZER_PARENT FreeSWITCH variable. Failure to do so will result in a failure to construct the GStreamer class. Recognizers allow commonly used streaming recognition parameters to be stored in the cloud. These stored values can be overridden with parameters passed at runtime but it is essential to provide a recognizer to v2 streaming recognition invocations. If you happen to have already created a recognizer in your Google Cloud account its id can be passed using the GOOGLE_SPEECH_RECOGNIZER_ID variable. If this is not set then mod_google_transcribe will just use the so called wildcard recognizer id ( the "_" character) and a recognizer will be created on the fly and not stored for future use. Note that even if a persistent recognizer is not required, it is always necessary to provide at least the parent id of the recognizer in GOOGLE_SPEECH_RECOGNIZER_PARENT, otherwise even the wildcard recognizer cannot be created. This parent id is a path string which consists of the google cloud project id which was used to create the google credentials file used, and a geographical location. For more details about recognizers, see https://cloud.google.com/speech-to-text/v2/docs/recognizers
As long as GOOGLE_SPEECH_CLOUD_SERVICES_VERSION is set to "v2" and GOOGLE_SPEECH_RECOGNIZER_PARENT is also set to a valid recognizer parent id then the "v2" library will be used and calls to uuid_google_transcribe should function as it did previously and any configuration parameters provided at runtime will override anything already defined in a predefined recognizer.
Differences between v1 and v2
No single utterances in v2. That is to say that it is no longer required to specify this as a parameter. Instead it is taken to be implicit from the model selected. If single utterance behaviour is required then this is supported by the short model, for example. To see more details on models see https://cloud.google.com/speech-to-text/v2/docs/streaming-recognize.
Multiple Language Support. If you provide up to a maximum of three languages to the recognition request, the speech engine will determine which of the three languages is most likely to have been spoken, automatically.
There are sure to be many more differences but these are the main things I found so far.
Some Notes on the Code and Building
To avoid code duplication we placed 'v1 specific code in google_glue_v1.cpp and the v2 specific stuff in google_glue_v2.cpp. Generic code used by both libraries now resides in generic_google_glue.h. We use our own docker image to build the drachtio modules but our make file is based on this one:
https://github.com/drachtio/docker-drachtio-freeswitch-base/blob/main/files/Makefile.am.extra
In order to compile and link the v2 stuff we had to add the following lines to the nodist_libfreeswitch_libgoogleapis_la_SOURCES assignment:
If you don't do this, you'll most likely get some problems linking.
That's all I can think of for now. It would be really great if you also find this useful and we manage to get it merged. I am of course available for questions.
This PR addresses #149 and offers support for the
v2
version of the Speech-To-Text library whilst still supportingv1
simultaneously. The default behaviour is to use thev1
version of the library where everything works identically to the way it did in the previous version. In order to usev2
the FreeSWITCH variableGOOGLE_SPEECH_CLOUD_SERVICES_VERSION
must be set to the value "v2". Setting it to "v1" or not setting it at all results in the default behaviour.If the variable is used then it is essential to provide a so called recognizer parent path in the
GOOGLE_SPEECH_RECOGNIZER_PARENT
FreeSWITCH variable. Failure to do so will result in a failure to construct theGStreamer
class. Recognizers allow commonly used streaming recognition parameters to be stored in the cloud. These stored values can be overridden with parameters passed at runtime but it is essential to provide a recognizer tov2
streaming recognition invocations. If you happen to have already created a recognizer in your Google Cloud account its id can be passed using theGOOGLE_SPEECH_RECOGNIZER_ID
variable. If this is not set thenmod_google_transcribe
will just use the so called wildcard recognizer id ( the "_" character) and a recognizer will be created on the fly and not stored for future use. Note that even if a persistent recognizer is not required, it is always necessary to provide at least the parent id of the recognizer inGOOGLE_SPEECH_RECOGNIZER_PARENT
, otherwise even the wildcard recognizer cannot be created. This parent id is a path string which consists of the google cloud project id which was used to create the google credentials file used, and a geographical location. For more details about recognizers, see https://cloud.google.com/speech-to-text/v2/docs/recognizersAs long as
GOOGLE_SPEECH_CLOUD_SERVICES_VERSION
is set to "v2" andGOOGLE_SPEECH_RECOGNIZER_PARENT
is also set to a valid recognizer parent id then the "v2" library will be used and calls touuid_google_transcribe
should function as it did previously and any configuration parameters provided at runtime will override anything already defined in a predefined recognizer.Differences between
v1
andv2
v2
. That is to say that it is no longer required to specify this as a parameter. Instead it is taken to be implicit from the model selected. If single utterance behaviour is required then this is supported by theshort
model, for example. To see more details on models see https://cloud.google.com/speech-to-text/v2/docs/streaming-recognize.mod_google_transcribe
forv2
but I didn't manage to stuble across a combination of model, language and location which supports this. See https://stackoverflow.com/questions/76779418/speaker-diarization-is-disabled-even-for-supported-languages-in-google-speech-toThere are sure to be many more differences but these are the main things I found so far.
Some Notes on the Code and Building
To avoid code duplication we placed '
v1
specific code ingoogle_glue_v1.cpp
and thev2
specific stuff ingoogle_glue_v2.cpp
. Generic code used by both libraries now resides ingeneric_google_glue.h
. We use our own docker image to build the drachtio modules but our make file is based on this one: https://github.com/drachtio/docker-drachtio-freeswitch-base/blob/main/files/Makefile.am.extra In order to compile and link thev2
stuff we had to add the following lines to thenodist_libfreeswitch_libgoogleapis_la_SOURCES
assignment:If you don't do this, you'll most likely get some problems linking.
That's all I can think of for now. It would be really great if you also find this useful and we manage to get it merged. I am of course available for questions.