Speech-To-Text Module - Githubissues

chidiewenike commented 3 years ago

Objective

Explore offline Speech-To-Text (STT) libraries that will convert raw audio bytes to a string.

Key Result

Create a function that will output a string from raw audio bytes input.

Details

The function will take, as input, raw audio bytes. The properties of the audio is TBD. The raw audio bytes are then converted to a string by an offline/local STT library. Beyond memory, the priority should be a library that allows custom speech adaption. Speech adaption will allow some sort of user input (list of words, transcripts, etc) to disambiguate uncommon words.

When selecting the appropriate library, priorities are as follows:

Memory
Customizable speech understanding
Customizability of sound properties
Runtime

mfekadu commented 3 years ago

@hhokari and @Jason-Ku check out this fantastic open-source STT/TTS project by Mozilla:

https://github.com/mozilla/DeepSpeech

They have releases here (pre-trained models) (seems kept up-to-date because latest release was last month):

https://github.com/mozilla/DeepSpeech/releases/latest

and their source data here:

https://voice.mozilla.org

Jason-Ku commented 3 years ago

I just set up and tried out DeepSpeech, it's pretty darn cool and pretty much works out of the box! awesome find, Michael

Jason-Ku commented 3 years ago

Some preliminary testing shows that the STT module was running with just shy of 12.8% of my computer's memory (16GB), so we're looking at just over 2GB memory

chidiewenike commented 3 years ago

Do you know what libraries are pulled in? Which model are you using? I remember there being a TFLite model as well which is built for mobile apps and embedded systems.

chidiewenike commented 3 years ago

We have up to 8 GB of memory so it won't cause any serious issues but it does increase the cost per device.

Jason-Ku commented 3 years ago

I'm using this model: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.pbmm

Just realized it's not the latest one (0.8.0) so I'll download that when my internet starts working again and give that a shot.

Not sure what all the libraries being pulled in are

chidiewenike commented 3 years ago

See if you can work with the TFLite model. That is built to be a bit more lightweight.

mfekadu commented 3 years ago

what about this one?

native_client.rpi3.cpu.linux.tar.xz
940 KB

download link

Jason-Ku commented 3 years ago

I give that model a shot later tonight @mfekadu! or @hhokari can try that one out.

Here are some metrics from a sample usage of the tflite model:

NOTES:

memory usage is measured in MEBIBYTES
the memory profiler itself uses quite a bit of memory
I only put the profiler on the main method
totally bungled the sentence

how did copleston pacific ranch
Memory usage (in chunks of .1 seconds): [26.81640625, 26.8515625, 69.7421875, 75.79296875, 76.296875, 93.7890625, 96.40625, 98.66015625, 101.80078125, 102.08203125, 28.9921875]
Maximum memory usage: 102.08203125
Filename: audio.py

Line #    Mem usage    Increment   Line Contents
================================================
    92   26.852 MiB   26.852 MiB   @profile
    93                             def stt():
    94   26.855 MiB    0.004 MiB       parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
    95   26.855 MiB    0.000 MiB       parser.add_argument('--model', required=True,
    96   26.855 MiB    0.000 MiB                           help='Path to the model (protocol buffer binary file)')
    97   26.855 MiB    0.000 MiB       parser.add_argument('--scorer', required=False,
    98   26.855 MiB    0.000 MiB                           help='Path to the external scorer file')
    99   26.855 MiB    0.000 MiB       parser.add_argument('--audio', required=True,
   100   26.855 MiB    0.000 MiB                           help='Path to the audio file to run (WAV format)')
   101   26.855 MiB    0.000 MiB       parser.add_argument('--beam_width', type=int,
   102   26.855 MiB    0.000 MiB                           help='Beam width for the CTC decoder')
   103   26.855 MiB    0.000 MiB       parser.add_argument('--lm_alpha', type=float,
   104   26.859 MiB    0.004 MiB                           help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
   105   26.859 MiB    0.000 MiB       parser.add_argument('--lm_beta', type=float,
   106   26.859 MiB    0.000 MiB                           help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
   107   26.859 MiB    0.000 MiB       parser.add_argument('--version', action=VersionAction,
   108   26.859 MiB    0.000 MiB                           help='Print version and exits')
   109   26.859 MiB    0.000 MiB       parser.add_argument('--extended', required=False, action='store_true',
   110   26.859 MiB    0.000 MiB                           help='Output string from extended metadata')
   111   26.859 MiB    0.000 MiB       parser.add_argument('--json', required=False, action='store_true',
   112   26.859 MiB    0.000 MiB                           help='Output json from metadata with timestamp of each word')
   113   26.859 MiB    0.000 MiB       parser.add_argument('--candidate_transcripts', type=int, default=3,
   114   26.859 MiB    0.000 MiB                           help='Number of candidate transcripts to include in JSON output')
   115   26.863 MiB    0.004 MiB       args = parser.parse_args()
   116                             
   117   26.863 MiB    0.000 MiB       print('Loading model from file {}'.format(args.model), file=sys.stderr)
   118   26.863 MiB    0.000 MiB       model_load_start = timer()
   119                                 # sphinx-doc: python_ref_model_start
   120   27.707 MiB    0.844 MiB       ds = Model(args.model)
   121                                 # sphinx-doc: python_ref_model_stop
   122   27.707 MiB    0.000 MiB       model_load_end = timer() - model_load_start
   123   27.711 MiB    0.004 MiB       print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)
   124                             
   125   27.711 MiB    0.000 MiB       if args.beam_width:
   126                                     ds.setBeamWidth(args.beam_width)
   127                             
   128   27.715 MiB    0.004 MiB       desired_sample_rate = ds.sampleRate()
   129                             
   130   27.715 MiB    0.000 MiB       if args.scorer:
   131   27.715 MiB    0.000 MiB           print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
   132   27.715 MiB    0.000 MiB           scorer_load_start = timer()
   133   27.863 MiB    0.148 MiB           ds.enableExternalScorer(args.scorer)
   134   27.863 MiB    0.000 MiB           scorer_load_end = timer() - scorer_load_start
   135   27.863 MiB    0.000 MiB           print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)
   136                             
   137   27.863 MiB    0.000 MiB           if args.lm_alpha and args.lm_beta:
   138                                         ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
   139                             
   140   27.863 MiB    0.000 MiB       fin = wave.open(args.audio, 'rb')
   141   27.863 MiB    0.000 MiB       fs_orig = fin.getframerate()
   142   27.863 MiB    0.000 MiB       if fs_orig != desired_sample_rate:
   143   27.863 MiB    0.000 MiB           print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
   144   28.145 MiB    0.281 MiB           fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
   145                                 else:
   146                                     audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
   147                             
   148   28.145 MiB    0.000 MiB       audio_length = fin.getnframes() * (1/fs_orig)
   149   28.145 MiB    0.000 MiB       fin.close()
   150                             
   151   28.145 MiB    0.000 MiB       print('Running inference.', file=sys.stderr)
   152   28.145 MiB    0.000 MiB       inference_start = timer()
   153                                 # sphinx-doc: python_ref_inference_start
   154   28.145 MiB    0.000 MiB       if args.extended:
   155                                     print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
   156   28.145 MiB    0.000 MiB       elif args.json:
   157                                     print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
   158                                 else:
   159  102.641 MiB   74.496 MiB           print(ds.stt(audio))
   160                             
   161                                 # sphinx-doc: python_ref_inference_stop
   162  102.641 MiB    0.000 MiB       inference_end = timer() - inference_start
   163  102.641 MiB    0.000 MiB       print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)

mfekadu commented 3 years ago

Super cool @Jason-Ku

Perhaps we can make good use of the extra memory by fine-tuning the pre-trained model to ensure that domain-specific words will work (e.g. Cal Poly != Cow Police).

Highlighted below is the data they used to train on:

Jason-Ku commented 3 years ago

what about this one?
native_client.rpi3.cpu.linux.tar.xz
940 KB
download link

Not sure how to get this working, unzipped it and theres no model files here, just a bunch of hex data

might need to compile in c bummer: /usr/bin/ld: unknown architecture of input file `deepspeech' is incompatible with i386:x86-64 output

snekiam commented 3 years ago

It's setup for ARM - I'll test it on a raspberry pi

snekiam commented 3 years ago

Running on pi, looks like it needs SoX installed:

./deepspeech: error while loading shared libraries: libsox.so.3: cannot open shared object file: No such file or directory

mfekadu commented 3 years ago

@Jason-Ku 's memory profiling python script

```python #!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import absolute_import, division, print_function import argparse import numpy as np import shlex import subprocess import sys import wave import json import time from deepspeech import Model, version from timeit import default_timer as timer from memory_profiler import memory_usage try: from shhlex import quote except ImportError: from pipes import quote def convert_samplerate(audio_path, desired_sample_rate): sox_cmd = 'sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - '.format(quote(audio_path), desired_sample_rate) try: output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE) except subprocess.CalledProcessError as e: raise RuntimeError('SoX returned non-zero status: {}'.format(e.stderr)) except OSError as e: raise OSError(e.errno, 'SoX not found, use {}hz files or install it: {}'.format(desired_sample_rate, e.strerror)) return desired_sample_rate, np.frombuffer(output, np.int16) def metadata_to_string(metadata): return ''.join(token.text for token in metadata.tokens) def words_from_candidate_transcript(metadata): word = "" word_list = [] word_start_time = 0 # Loop through each character for i, token in enumerate(metadata.tokens): # Append character to word if it's not a space if token.text != " ": if len(word) == 0: # Log the start time of the new word word_start_time = token.start_time word = word + token.text # Word boundary is either a space or the last character in the array if token.text == " " or i == len(metadata.tokens) - 1: word_duration = token.start_time - word_start_time if word_duration < 0: word_duration = 0 each_word = dict() each_word["word"] = word each_word["start_time "] = round(word_start_time, 4) each_word["duration"] = round(word_duration, 4) word_list.append(each_word) # Reset word = "" word_start_time = 0 return word_list def metadata_json_output(metadata): json_result = dict() json_result["transcripts"] = [{ "confidence": transcript.confidence, "words": words_from_candidate_transcript(transcript), } for transcript in metadata.transcripts] return json.dumps(json_result, indent=2) class VersionAction(argparse.Action): def __init__(self, *args, **kwargs): super(VersionAction, self).__init__(nargs=0, *args, **kwargs) def __call__(self, *args, **kwargs): print('DeepSpeech ', version()) exit(0) @profile def stt(): parser = argparse.ArgumentParser(description='Running DeepSpeech inference.') parser.add_argument('--model', required=True, help='Path to the model (protocol buffer binary file)') parser.add_argument('--scorer', required=False, help='Path to the external scorer file') parser.add_argument('--audio', required=True, help='Path to the audio file to run (WAV format)') parser.add_argument('--beam_width', type=int, help='Beam width for the CTC decoder') parser.add_argument('--lm_alpha', type=float, help='Language model weight (lm_alpha). If not specified, use default from the scorer package.') parser.add_argument('--lm_beta', type=float, help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.') parser.add_argument('--version', action=VersionAction, help='Print version and exits') parser.add_argument('--extended', required=False, action='store_true', help='Output string from extended metadata') parser.add_argument('--json', required=False, action='store_true', help='Output json from metadata with timestamp of each word') parser.add_argument('--candidate_transcripts', type=int, default=3, help='Number of candidate transcripts to include in JSON output') args = parser.parse_args() print('Loading model from file {}'.format(args.model), file=sys.stderr) model_load_start = timer() # sphinx-doc: python_ref_model_start ds = Model(args.model) # sphinx-doc: python_ref_model_stop model_load_end = timer() - model_load_start print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr) if args.beam_width: ds.setBeamWidth(args.beam_width) desired_sample_rate = ds.sampleRate() if args.scorer: print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr) scorer_load_start = timer() ds.enableExternalScorer(args.scorer) scorer_load_end = timer() - scorer_load_start print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr) if args.lm_alpha and args.lm_beta: ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta) fin = wave.open(args.audio, 'rb') fs_orig = fin.getframerate() if fs_orig != desired_sample_rate: print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr) fs_new, audio = convert_samplerate(args.audio, desired_sample_rate) else: audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16) audio_length = fin.getnframes() * (1/fs_orig) fin.close() print('Running inference.', file=sys.stderr) inference_start = timer() # sphinx-doc: python_ref_inference_start if args.extended: print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0])) elif args.json: print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts))) else: print(ds.stt(audio)) # sphinx-doc: python_ref_inference_stop inference_end = timer() - inference_start print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr) if __name__ == '__main__': mem_usage = memory_usage(stt) print('Memory usage (in chunks of .1 seconds): %s' % mem_usage) print('Maximum memory usage: %s' % max(mem_usage)) ```

hhokari commented 3 years ago

I was just able to get deep speech running; really cool!

snekiam commented 3 years ago

Ran on a Raspberry Pi 4b with 1gb of ram: Its possible that benchmarking slows things down, but we're very much CPU bound on this, not ram bound. Ran on a ~10sec audio file, found here (preamble10.wav)

mfekadu commented 3 years ago

That's great @hhokari !

Thanks for the analysis and soundfile @snekiam !

mfekadu commented 3 years ago

For some reason that audiofile preamble10.wav does not work nicely with my deepspeech executable on my mac

@snekiam

WAVE: RIFF header not found

``` ➜ native_client.amd64.cpu.osx ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio audio/preamble10.wav -t TensorFlow: v2.2.0-17-g0854bb5188 DeepSpeech: v0.8.0-0-gf56b07da 2020-08-09 17:45:05.920602: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA formats: can't open input file `audio/preamble10.wav': WAVE: RIFF header not found Assertion failed: (input), function GetAudioBuffer, file client.cc, line 228. [1] 39132 abort ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio -t ```

mfekadu commented 3 years ago

I realized that my audiofile was corrupted during the download. re-downloaded and that fixed it.

New issue: I found some interesting mistakes (highlighted below) that occur on my CPU but not within your screenshot @snekiam

mfekadu commented 3 years ago

The screenshot above is also using the pbmm model (deepspeech-0.8.0-models.pbmm) rather than the tflite model

Here is a link to the docs about the pre-trained models

https://deepspeech.readthedocs.io/en/v0.8.0/USING.html

snekiam commented 3 years ago

Some more interesting info on preamble10.wav - potentially to do with why it took so long to process: 22.05khz is potentially a higher sample rate than we're going to use. We also might want to consider using a USB accelerator, like the coral if things don't perform well, but I'm not 100% convinced that we'll need it - deepspeech should be able to do realtime audio on a Pi 4b, according to mozilla. I'm going to try a Pi 4b-specific compilation rather than the Pi 3b version I ran earlier.

snekiam commented 3 years ago

Some more interesting data - Mozilla claims deepspeech is real-time on the Pi 4, and we're not constrained by the 1gb memory on this specific Pi 4b. I wonder if we're limited by the SD card. I'll try a quick flash drive tomorrow, kinda interested in trying to boot from USB rather than SD anyways. This shows the difference between the deepspeech reported time and the system's - 2.164s for a 1.975s audio file is much better than the 22 seconds for the preamble file above! The above file is a lower bitrate and sampling rate, which might have an effect also: Ideally, we'll want to process audio as it comes in rather than reading from disk anyway, which may make the read/write speed of our medium irrelevant.

chidiewenike commented 3 years ago

Audio data should be read from the audio stream buffer and stored on RAM. That is what we do for the wake-word on NIMBUS and the GCP STT API.

chidiewenike commented 3 years ago

Based on their documentation, they seem to use 16 KHz although the Baidu paper suggests that both 16 KHz and 8 KHz datasets were used. They seem to use Sox to resample their data. That process might add some time.

calpoly-csai / swanton

Speech-To-Text Module #2

Objective

Key Result

Details

NOTES: