Open chidiewenike opened 3 years ago
@hhokari and @Jason-Ku check out this fantastic open-source STT/TTS project by Mozilla:
They have releases here (pre-trained models) (seems kept up-to-date because latest release was last month):
and their source data here:
I just set up and tried out DeepSpeech, it's pretty darn cool and pretty much works out of the box! awesome find, Michael
Some preliminary testing shows that the STT module was running with just shy of 12.8% of my computer's memory (16GB), so we're looking at just over 2GB memory
Do you know what libraries are pulled in? Which model are you using? I remember there being a TFLite model as well which is built for mobile apps and embedded systems.
We have up to 8 GB of memory so it won't cause any serious issues but it does increase the cost per device.
I'm using this model: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.pbmm
Just realized it's not the latest one (0.8.0) so I'll download that when my internet starts working again and give that a shot.
Not sure what all the libraries being pulled in are
See if you can work with the TFLite model. That is built to be a bit more lightweight.
I give that model a shot later tonight @mfekadu! or @hhokari can try that one out.
Here are some metrics from a sample usage of the tflite model:
how did copleston pacific ranch
Memory usage (in chunks of .1 seconds): [26.81640625, 26.8515625, 69.7421875, 75.79296875, 76.296875, 93.7890625, 96.40625, 98.66015625, 101.80078125, 102.08203125, 28.9921875]
Maximum memory usage: 102.08203125
Filename: audio.py
Line # Mem usage Increment Line Contents
================================================
92 26.852 MiB 26.852 MiB @profile
93 def stt():
94 26.855 MiB 0.004 MiB parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
95 26.855 MiB 0.000 MiB parser.add_argument('--model', required=True,
96 26.855 MiB 0.000 MiB help='Path to the model (protocol buffer binary file)')
97 26.855 MiB 0.000 MiB parser.add_argument('--scorer', required=False,
98 26.855 MiB 0.000 MiB help='Path to the external scorer file')
99 26.855 MiB 0.000 MiB parser.add_argument('--audio', required=True,
100 26.855 MiB 0.000 MiB help='Path to the audio file to run (WAV format)')
101 26.855 MiB 0.000 MiB parser.add_argument('--beam_width', type=int,
102 26.855 MiB 0.000 MiB help='Beam width for the CTC decoder')
103 26.855 MiB 0.000 MiB parser.add_argument('--lm_alpha', type=float,
104 26.859 MiB 0.004 MiB help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
105 26.859 MiB 0.000 MiB parser.add_argument('--lm_beta', type=float,
106 26.859 MiB 0.000 MiB help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
107 26.859 MiB 0.000 MiB parser.add_argument('--version', action=VersionAction,
108 26.859 MiB 0.000 MiB help='Print version and exits')
109 26.859 MiB 0.000 MiB parser.add_argument('--extended', required=False, action='store_true',
110 26.859 MiB 0.000 MiB help='Output string from extended metadata')
111 26.859 MiB 0.000 MiB parser.add_argument('--json', required=False, action='store_true',
112 26.859 MiB 0.000 MiB help='Output json from metadata with timestamp of each word')
113 26.859 MiB 0.000 MiB parser.add_argument('--candidate_transcripts', type=int, default=3,
114 26.859 MiB 0.000 MiB help='Number of candidate transcripts to include in JSON output')
115 26.863 MiB 0.004 MiB args = parser.parse_args()
116
117 26.863 MiB 0.000 MiB print('Loading model from file {}'.format(args.model), file=sys.stderr)
118 26.863 MiB 0.000 MiB model_load_start = timer()
119 # sphinx-doc: python_ref_model_start
120 27.707 MiB 0.844 MiB ds = Model(args.model)
121 # sphinx-doc: python_ref_model_stop
122 27.707 MiB 0.000 MiB model_load_end = timer() - model_load_start
123 27.711 MiB 0.004 MiB print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)
124
125 27.711 MiB 0.000 MiB if args.beam_width:
126 ds.setBeamWidth(args.beam_width)
127
128 27.715 MiB 0.004 MiB desired_sample_rate = ds.sampleRate()
129
130 27.715 MiB 0.000 MiB if args.scorer:
131 27.715 MiB 0.000 MiB print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
132 27.715 MiB 0.000 MiB scorer_load_start = timer()
133 27.863 MiB 0.148 MiB ds.enableExternalScorer(args.scorer)
134 27.863 MiB 0.000 MiB scorer_load_end = timer() - scorer_load_start
135 27.863 MiB 0.000 MiB print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)
136
137 27.863 MiB 0.000 MiB if args.lm_alpha and args.lm_beta:
138 ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
139
140 27.863 MiB 0.000 MiB fin = wave.open(args.audio, 'rb')
141 27.863 MiB 0.000 MiB fs_orig = fin.getframerate()
142 27.863 MiB 0.000 MiB if fs_orig != desired_sample_rate:
143 27.863 MiB 0.000 MiB print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
144 28.145 MiB 0.281 MiB fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
145 else:
146 audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
147
148 28.145 MiB 0.000 MiB audio_length = fin.getnframes() * (1/fs_orig)
149 28.145 MiB 0.000 MiB fin.close()
150
151 28.145 MiB 0.000 MiB print('Running inference.', file=sys.stderr)
152 28.145 MiB 0.000 MiB inference_start = timer()
153 # sphinx-doc: python_ref_inference_start
154 28.145 MiB 0.000 MiB if args.extended:
155 print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
156 28.145 MiB 0.000 MiB elif args.json:
157 print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
158 else:
159 102.641 MiB 74.496 MiB print(ds.stt(audio))
160
161 # sphinx-doc: python_ref_inference_stop
162 102.641 MiB 0.000 MiB inference_end = timer() - inference_start
163 102.641 MiB 0.000 MiB print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)
Super cool @Jason-Ku
Perhaps we can make good use of the extra memory by fine-tuning the pre-trained model to ensure that domain-specific words will work (e.g. Cal Poly
!= Cow Police
).
Highlighted below is the data they used to train on:
what about this one?
native_client.rpi3.cpu.linux.tar.xz 940 KB
Not sure how to get this working, unzipped it and theres no model files here, just a bunch of hex data
might need to compile in c bummer: /usr/bin/ld: unknown architecture of input file `deepspeech' is incompatible with i386:x86-64 output
It's setup for ARM - I'll test it on a raspberry pi
Running on pi, looks like it needs SoX installed:
./deepspeech: error while loading shared libraries: libsox.so.3: cannot open shared object file: No such file or directory
@Jason-Ku 's memory profiling python script
I was just able to get deep speech running; really cool!
Ran on a Raspberry Pi 4b with 1gb of ram:
Its possible that benchmarking slows things down, but we're very much CPU bound on this, not ram bound. Ran on a ~10sec audio file, found here (preamble10.wav)
That's great @hhokari !
Thanks for the analysis and soundfile @snekiam !
For some reason that audiofile preamble10.wav
does not work nicely with my deepspeech
executable on my mac
@snekiam
I realized that my audiofile was corrupted during the download. re-downloaded and that fixed it.
New issue: I found some interesting mistakes (highlighted below) that occur on my CPU but not within your screenshot @snekiam
The screenshot above is also using the pbmm
model (deepspeech-0.8.0-models.pbmm
) rather than the tflite
model
Here is a link to the docs about the pre-trained models
Some more interesting info on preamble10.wav - potentially to do with why it took so long to process:
22.05khz is potentially a higher sample rate than we're going to use. We also might want to consider using a USB accelerator, like the coral if things don't perform well, but I'm not 100% convinced that we'll need it - deepspeech should be able to do realtime audio on a Pi 4b, according to mozilla. I'm going to try a Pi 4b-specific compilation rather than the Pi 3b version I ran earlier.
Some more interesting data - Mozilla claims deepspeech is real-time on the Pi 4, and we're not constrained by the 1gb memory on this specific Pi 4b. I wonder if we're limited by the SD card. I'll try a quick flash drive tomorrow, kinda interested in trying to boot from USB rather than SD anyways.
This shows the difference between the deepspeech reported time and the system's - 2.164s for a 1.975s audio file is much better than the 22 seconds for the preamble file above! The above file is a lower bitrate and sampling rate, which might have an effect also:
Ideally, we'll want to process audio as it comes in rather than reading from disk anyway, which may make the read/write speed of our medium irrelevant.
Audio data should be read from the audio stream buffer and stored on RAM. That is what we do for the wake-word on NIMBUS and the GCP STT API.
Based on their documentation, they seem to use 16 KHz although the Baidu paper suggests that both 16 KHz and 8 KHz datasets were used. They seem to use Sox to resample their data. That process might add some time.
Objective
Explore offline Speech-To-Text (STT) libraries that will convert raw audio bytes to a string.
Key Result
Create a function that will output a string from raw audio bytes input.
Details
The function will take, as input, raw audio bytes. The properties of the audio is TBD. The raw audio bytes are then converted to a string by an offline/local STT library. Beyond memory, the priority should be a library that allows custom speech adaption. Speech adaption will allow some sort of user input (list of words, transcripts, etc) to disambiguate uncommon words.
When selecting the appropriate library, priorities are as follows: