bjornbytes / lua-deepspeech

Lua Library for Speech Recognition
MIT License
24 stars 0 forks source link
deepspeech speech speech-recognition speech-to-text

lua-deepspeech

Lua bindings for DeepSpeech, an open source speech recognition library. Intended for use with LÖVR and LÖVE, but it should work with any Lua program that has audio samples in a table or a lightuserdata.

Here's a simple example of using it to do speech-to-text on an audio file:

lovr.speech = require 'lua-deepspeech'

function lovr.load()
  lovr.speech.init({ model = '/path/to/model.pbmm' })

  local sound = lovr.data.newSound('speech.ogg')
  local samples = sound:getBlob():getPointer()
  local count = sound:getFrameCount()

  print(lovr.speech.decode(samples, count))
end

DeepSpeech Setup

Note: There are multiple flavors of the native client. The cpu flavor runs on the CPU, the cuda flavor runs on the GPU with CUDA, and the tflite flavor can use the smaller tflite model instead of the pbmm one. It's recommended to start with the cpu flavor.

Scorer

You can also optionally create a thing called a "scorer package". The scorer acts as the grammar or vocabulary for the recognition, allowing it to recognize a custom set of words or phrases. This can improve accuracy and speed by a lot, and is useful if you only have a few words or commands that need to be detected. See here for instructions on generating a scorer.

Building

Once you have the DeepSpeech files downloaded, build the Lua bindings in this repository. You can download prebuilt files from the releases page (TBD, still trying to get GitHub Actions working on Windows) or build them using CMake. If you're using LÖVR you can also add this repository to the plugins folder and rebuild. The DEEPSPEECH_PATH variable needs to be set to the path to the native client.

$ mkdir build
$ cd build
$ cmake .. -DDEEPSPEECH_PATH=/path/to/native_client
$ cmake --build .

This should output lua-deepspeech.dll or lua-deepspeech.so.

The deepspeech native_client library needs to be placed somewhere that it can be loaded at runtime and the lua-deepspeech library needs to be somewhere that it can be required by Lua. For LÖVR both of these can be put next to the lovr executable (building as a plugin will take care of this). For other engines it will probably be different.

Note: on Windows the deepspeech library has a really weird name: libdeepspeech.so

Usage

First, require the module:

local speech = require 'lua-deepspeech'

It returns a table with the library's functionality.

success, sampleRate = speech.init(options)

The library must be initialized with an options table. The table can contain the following options:

The function either returns false plus an error message or true and the audio sample rate that the model was trained against. All audio must be provided as signed 16 bit mono samples at this sample rate. It's almost always 16000Hz.

text = speech.decode(table)
text = speech.decode(pointer, count)

This function performs speech-to-text. A table of audio samples can be provided, or a lightuserdata pointer with a sample count.

In all cases the audio data must be formatted as signed 16 bit mono samples at the model's sample rate.

Returns a string with the decoded text.

transcripts = speech.analyze(table, limit)
transcripts = speech.analyze(pointer, count, limit)

This is the same as decode, but returns extra metadata about the result. The return value is a list of transcripts. Each transcript is a table with:

limit can optionally be used to limit the number of transcripts returned, defaulting to 5.

speech.boost(word, amount)

Boosts a word.

speech.unboost(word)
speech.unboost()

Unboosts a word, or unboosts all words if no arguments are provided.

Streams

A stream object can be used to decode audio in real time as it arrives. Usually you'd use this with audio coming from a microphone.

stream = speech.newStream()

Creates a new Stream.

Stream:feed(table)
Stream:feed(pointer, count)

Feeds audio to the Stream. Accepts the same arguments as speech.decode.

text = Stream:decode()

Performs an intermediate decode on the audio data fed to the Stream, returning the decoded text. Additional audio can continue to be fed to the Stream after this function is called.

transcripts = Stream:analyze()

Performs an intermediate analysis on the audio data fed to the Stream. See speech.analyze. Additional audio can continue to be fed to the Stream after this function is called.

text = Stream:finish()

Finishes and resets the Stream, returning the final decoded text.

Stream:clear()

Resets the Stream, erasing all audio that has been fed to it.

Tips

License

MIT, see LICENSE for details.