Uberi / speech_recognition

Speech recognition module for Python, supporting several engines and APIs, online and offline.
https://pypi.python.org/pypi/SpeechRecognition/
BSD 3-Clause "New" or "Revised" License
8.33k stars 2.4k forks source link

Add model to recognize_ibm #59

Closed MarkusMcNugen closed 8 years ago

MarkusMcNugen commented 8 years ago

Add a model argument to def recognize_ibm() like in the code below to allow the selection of using Wideband or Narrowband model and use the Narrowband model by default. Note bandmodel and model variables.

I'm requesting this because while using speechrecognition I was getting an HTTPError exception telling me "request failed, ensure that username and password are correct"_ when I knew the username and password were correct. The actual cause was me using a narrowband wav while speech_recognition was trying to upload it as a wideband. According to IBMs documentation submitting a wideband audio source with the narrowband model should work fine, but not the other way around.

As per IBMs Speech to Text API documentation: "The service automatically adjusts the incoming sampling rate to match the model. In theory, therefore, you can send 44 KHz audio with the narrowband model. Note, however, that the service does not accept audio sampled at a lower rate than the intrinsic sample rate of the model."

    def recognize_ibm(self, audio_data, username, password, bandmodel, language = "en-US", show_all = False):
        """
        Performs speech recognition on ``audio_data`` (an ``AudioData`` instance), using the IBM Speech to Text API.

        The IBM Speech to Text username and password are specified by ``username`` and ``password``, respectively. Unfortunately, these are not available without an account. IBM has published instructions for obtaining these credentials in the `IBM Watson Developer Cloud documentation <https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/getting_started/gs-credentials.shtml>`__.

        The band model is determined by ``bandmodel``, supported models are ``"WideBand"`` 16Khz sampling rate or ``"Narrowband"`` 8Khz sampling rate, defaulting to Narrowband. IBMs Speech to Text API  automatically adjusts the incoming sampling rate to match the model, however does not accept audio sampled at a lower rate than the intrinsic sample rate of the model.

        The recognition language is determined by ``language``, an IETF language tag with a dialect like ``"en-US"`` or ``"es-ES"``, defaulting to US English. At the moment, this supports the tags ``"en-US"``, ``"es-ES"``, and ``"ja-JP"``.

        Returns the most likely transcription if ``show_all`` is false (the default). Otherwise, returns the `raw API response <http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text/api/v1/#recognize>`__ as a JSON dictionary.

        Raises a ``speech_recognition.UnknownValueError`` exception if the speech is unintelligible. Raises a ``speech_recognition.RequestError`` exception if the key isn't valid, or there is no internet connection.
        """
        assert isinstance(audio_data, AudioData), "Data must be audio data"
        assert isinstance(username, str), "`username` must be a string"
        assert isinstance(password, str), "`password` must be a string"
        assert bandmodel in ["Wideband", "Narrowband"], "`model` must be a valid model"
        assert language in ["en-US", "es-ES", "ja-JP"], "`language` must be a valid language."

        flac_data = audio_data.get_flac_data()
        if bandmodel is not None:
            model = "{0}_{1}".format(language, bandmodel)
        else:
            model = "{0}_Narrowband".format(language)
        url = "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?continuous=true&model={0}".format(model)
        request = Request(url, data = flac_data, headers = {"Content-Type": "audio/x-flac"})
        if hasattr("", "encode"):
            authorization_value = base64.standard_b64encode("{0}:{1}".format(username, password).encode("utf-8")).decode("utf-8")
        else:
            authorization_value = base64.standard_b64encode("{0}:{1}".format(username, password))
        request.add_header("Authorization", "Basic {0}".format(authorization_value))
        try:
            response = urlopen(request)
        except HTTPError:
            raise RequestError("request failed, ensure that username and password are correct")
        except URLError:
            raise RequestError("no internet connection available to transfer audio data")
        response_text = response.read().decode("utf-8")
        result = json.loads(response_text)

        if show_all: return result

        if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
            raise UnknownValueError()
        for entry in result["results"][0]["alternatives"]:
            if "transcript" in entry: return entry["transcript"]

        # no transcriptions available
        raise UnknownValueError()
MarkusMcNugen commented 8 years ago

At the very least add the bandmodel variable to the function. I now see that IBMs Speech to Text API defaults to the Wideband model. So maybe that is the way to go, but using narrowband should still be supported to eliminate false HTTPError exceptions.

Uberi commented 8 years ago

Automatic upsampling is done in the latest version, 3.2.0, which fixes the issue of not being able to use sample rates below 16kHz.