PSA: Realtime audio frontend demo for macOS

lunixbochs commented 5 years ago

This is a working-out-of-the-box demo for realtime speech recognition on macOS with wav2letter++

This is based on my C API in https://github.com/facebookresearch/wav2letter/issues/326 There's a src dir in the w2l_cli tarball with the frontend source (w2l_cli.cpp) and scripts/instructions for building this all from scratch.

to install:

wget https://talonvoice.com/research/w2l_cli.tar.gz
tar -xf w2l_cli.tar.gz && rm w2l_cli.tar.gz
cd w2l_cli
wget https://talonvoice.com/research/epoch186-ls3_14.tar.gz
tar -xf epoch186-ls3_14.tar.gz && rm epoch186-ls3_14.tar.gz

to run: ./bin/w2l emit epoch186-ls3_14/model.bin epoch186-ls3_14/tokens.txt

Then speak, and you should see emissions (letter predictions) in the terminal output after you speak, for example:

$ ./bin/w2l emit epoch186-ls3_14/model.bin epoch186-ls3_14/tokens.txt 
helow|world
this|is|a|test|of|wave|to|leter

Language model decoding is also wired up via ./bin/w2l decode am tokens lm lexicon, but as per #326 it segfaults right now when setting up the Trie.

There are more pretrained english acoustic models at https://talonvoice.com/research/ you can try as well.

timdoug commented 5 years ago

The w2l binary points to dynamic library that isn't present:

$ ./bin/w2l 
dyld: Library not loaded: @rpath/libclang_rt.asan_osx_dynamic.dylib
  Referenced from: .../bin/w2l
  Reason: image not found
Abort trap: 6
$ otool -L ./bin/w2l | grep rpath
    @rpath/libaf.3.dylib (compatibility version 3.0.0, current version 3.6.2)
    @rpath/libmkldnn.0.dylib (compatibility version 0.0.0, current version 0.18.1)
    @rpath/libiomp5.dylib (compatibility version 5.0.0, current version 5.0.0)
$ ls bin/
libaf.3.dylib     libafcpu.3.dylib  libiomp5.dylib    libmkldnn.0.dylib libmklml.dylib    w2l
$

Fixed with install_name_tool:

$ install_name_tool -change @rpath/libclang_rt.asan_osx_dynamic.dylib /Library/Developer/CommandLineTools/usr/lib/clang/10.0.1/lib/darwin/libclang_rt.asan_osx_dynamic.dylib bin/w2l 
$ ./bin/w2l 
Usage: ./bin/w2l emit   <acoustic model> <tokens.txt>
Usage: ./bin/w2l decode <acoustic model> <tokens.txt> <language model> <lexicon>
$

This is on 10.14.5 with the command-line dev tools installed, not full Xcode; the path may be different in that case / different versions / etc.

lunixbochs commented 5 years ago

Yep, I accidentally released the version compiled with -fsanitize-address because I was trying to debug the segfault.

I’ll just upload one without that when I’m at my computer next.

lunixbochs commented 5 years ago

Ok, I uploaded a new version at the same URL that wasn't compiled with -fsanitize-address, and with a small improvement to capture more audio at the start of an utterance.

cogmeta commented 5 years ago

will it build successfully on Ubuntu?

lunixbochs commented 5 years ago

No, it uses a MacOS framework for audio capture.

cogmeta commented 5 years ago

as a favor, can you please provide a example with input from a wav file (and does not use any of the mac os audio capturing stuff) and printing the result? I was successfully able to build everything including libw2l.a on ubuntu 16.04 but now stuck with sample example.

lunixbochs commented 5 years ago

No, wav2letter already has code for loading an audio file in the right format using libsndfile in the featurization path, and the Test/Decode binaries can already run against sound files if you put them in the right dataset format.

cogmeta commented 5 years ago

Thanks, i did try ./Decoder but it show no results.

./Decoder -test ./data/ -am ../../../epoch186-ls3_14/model.bin -lm ../../wav2letter/src/decoder/test/lm.arpa -showletters -show Loading the LM will be faster if you build a binary file. Reading ../../wav2letter/src/decoder/test/lm.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

|T|: |P|: |t|: my|internet|is|not|working|man |p|: [sample: 0, WER: 100%, LER: 100%, slice WER: 100%, slice LER: 100%, progress: 100%]

lunixbochs commented 5 years ago

I think you should open a new issue for that. It's possible your sound file is in the wrong format. Make sure it's 16 bit 16000khz 1 channel.

maiasmith commented 5 years ago

FYI

wget https://talonvoice.com/research/w2l_cli.tar.gz
...
ERROR: cannot verify talonvoice.com's certificate, issued by ‘CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US’:
  Issued certificate has expired.

lunixbochs commented 5 years ago

Thanks, the cron job to restart nginx on cert auto renewal wasn’t working. Fixed.

sapphire-arches commented 5 years ago

I hacked together a Linux port of this here. If you don't want to lose your mind trying to get it to build, I recommend using the provided Nix derivations. I had to make several changes to the upstream w2l C API to get it working against the latest wav2letter, which are captured in this fork of wav2letter. It "works" in the sense that it'll pass data to and from the Wav2Letter system, though using epoch186 kindly provided by @lunixbochs results in somewhat underwhelming performance. That might be due to my sketchy normalization, bad microphone, and weak grasp on many of the deep technical details involved in actually deploying an accurate speech to text solution.

lunixbochs commented 5 years ago

Oh, sorry for the duplicate work. I posted w2l.cpp to the newer wav2letter APIs here: https://github.com/talonvoice/wav2letter/commits/w2lapi-mac

See if the emission output is a bit better for you. I haven’t tuned the decoder output of w2l_cli because I hadn’t patched the crash yet.

lunixbochs commented 5 years ago

What are you using for lexicon and language model? That has a huge impact on decoder quality.

sapphire-arches commented 5 years ago

I'm using the kenlm Wikipedia model from Talon Research, applied to the word based decoder. What is the expected range of values for the input vector?

lunixbochs commented 5 years ago

Try the emit example. You can copy the updated emit code from the branch I linked. If saying "hello world" slowly and clearly doesn't result in something remotely like "helow|world" or "helo|world", your audio input pipe is probably the culprit.

lunixbochs commented 5 years ago

You shouldn't be "normalizing" at all. You should just divide by INT16_MAX with no fabs() to directly convert between the two integer ranges.

PCM signed int16 is a range from -32768 - 32767. PCM float is a signed floating point number from -1.0 to 1.0. The mapping between the two ranges is linear.

Also if you can ask Pulse for floating point samples that's even better, as you only need to multiply by INT16_MAX for FVAD and you can store the correct wav2letter format through the whole pipe without modifying it yourself.

sapphire-arches commented 5 years ago

Dividing by INT16_MAX and using the w2l-mac branch (with some minor tweaks to the CMakeLists.txt to get it building in my environment) seems to be working well -- the letter decoder is very accurate now. Trying to use the word decode crashes for some reason that I have yet to fully debug. I'll push my fixes shortly if other people want to try it out on Linux. Thanks for the help @lunixbochs =)

lunixbochs commented 5 years ago

That’s great! Are you using w2l.cpp unmodified from that branch? It crashes for me on decoding too after hanging for a while. We aren’t setting any of the decoding flags, and I know they are important.

cogmeta commented 5 years ago

@bobtwinkles thank you.

sapphire-arches commented 5 years ago

It wasn't quite unmodified, you can find my changes in this branch. I'm not convinced it's working correctly though, as it seems like the BEAM search (I assume that's what runs after the emission -> letter conversion?) isn't modifying its inputs at all. Sorry about the slow responses, life has gotten in a way a bit.

jdunruh commented 5 years ago

Interesting demo.

lunixbochs commented 5 years ago

For anyone who's been playing with this, I've uploaded a new acoustic model, trained on around 3000 hours of audio (compared to my previous best models which were librispeech i.e. only ~960 hours): 400MB - Epoch 125 at https://talonvoice.com/research/

With these decoding parameters, it beats even my 1.6gb TER 2.64 librispeech model at decoding my real-world user speech test set:

-lmweight 1.50 -wordscore 3.060 -beamsize 483 -beamthreshold 25
-silweight .170 -smearing max

Please give it a try and let me know how it works for you.

sapphire-arches commented 5 years ago

That model seems to be noticeably better, at least in the short amount of time I played with it. thanks @lunixbochs.

vunder-kind commented 5 years ago

Thanks @lunixbochs for providing ready acoustic models, they are very useful and work perfectly! I am also working on training models for the Russian language. On my current tests on 1 GPU Tesla v100 I get a throughput = 40-60 sec/sec. The same throughput you mentioned in issue #259 . In this case it should take 3000 hours * 125 epochs / (40 sec/sec) = 9375 hours = 390 days to train your last model for example. This time looks very long for me. Could you tell if you found a way to speed up training? Also could you tell how many GPUs you are training on and what approximate throughput do you have at this moment?

Thank you!

lunixbochs commented 5 years ago

I’m training a smaller model (400MB instead of 1.6GB), on two 32GB V100s, with a batchsize of 64. My current throughput is around 670 sec/sec, and it was even higher with a smaller dataset (~750 s/s).

I also run the 1.6GB model with a batchsize of 16, which gives me around 160 sec/sec with librispeech.

lunixbochs commented 5 years ago

Also, while I don’t have the resources to train non-English models myself yet, if anyone wants to donate models with a liberal license (CC-0 or something), I’m happy to host them and attribute you on the page.

vineelpratap commented 5 years ago

@lunixbochs - Great work ! I was looking at https://github.com/talonvoice/wav2letter/commit/6671ad5f8311af99316b72f13a6f583e82d2ecbd - I don't see any code which takes care of padding which could be tricky for realtime inference. Wondering how do you handle padding ? Did you train the models with "SAME" convolution or you put all the padding at initial layer ?

lunixbochs commented 5 years ago

As I mentioned in another issue, I’m not doing block streaming. I use a voice activity detector to group an entire utterance’s audio together, then feed that into the network in one chunk. The output of this is the same as in training. I do nothing with the padding frames. Do you suggest I trim them?

I find because of the parallel convolution it’s faster overall to run the network on all of the audio in an utterance at once, rather than streaming say 500ms blocks into the network at a time.

For example, the forward pass on my network takes around 90ms in CPU on a 6-core Mac Mini for a typical utterance. This is much lower “final” latency than waiting on average 250ms for the next 500ms block to get put into the network to get results. So I don’t get streaming hypothesis, but the final latency is very good (time between end of speech and fully decoded text). I think if I wanted to show hypotheses I would use a very bad but very fast (like 20% WER) network for the first pass rather than trying to stream against the high quality network.

vineelpratap commented 5 years ago

I’m not doing block streaming. I use a voice activity detector to group an entire utterance’s audio together, then feed that into the network in one chunk.

That makes sense. Thanks for the clarification !

the forward pass on my network takes around 90ms in CPU on a 6-core Mac Mini for a typical utterance. This is much lower “final” latency than waiting on average 250ms for the next 500ms block to get put into the network to get results.

As long as it works for your use case it's fine. Although, note that by doing this you are not using the LM as one would like since you are not using any information from previous state - https://github.com/talonvoice/wav2letter/commit/6671ad5f8311af99316b72f13a6f583e82d2ecbd#diff-1720a2fc67091af436b08b1a16f44d96R175. You might also want to look at decodeContinue function so that you can use LMState from previous hypothesis.

lunixbochs commented 5 years ago

I’m providing input to a computer. So I can just feed the LM from the user’s cursor context where appropriate.

yasha02 commented 4 years ago

I get Segmentation fault: 11 when i try running "./bin/w2l emit epoch186-ls3_14/model.bin epoch186-ls3_14/tokens.txt"

jacobkahn commented 4 years ago

@yasha02 — would also encourage you to take a look at the recently-released inference framework to see if that meets some of your needs.

Mic92 commented 4 years ago

I upgraded the linux port w2l-linux: https://github.com/bobtwinkles/w2l-linux/pull/2

I downloaded a lexicon and language model from here https://talonvoice.com/research/enwiki-4gram-2019-04.tar.gz As well as the token.txt from here: e127-specaug-cb5_89-cv12_63-lsc3_00-lso9_28-sc2_16-tal9_41-tat0_45-ted8_42-tts2_12.bin

./w2l_cli
hello, world
Usage:
    ./w2l_cli <am.bin> <tokens.txt> <lm.bin> <lexicon.bin>

However the command line interface wants both a token.txt and acoustic model (am.bin). Where do I get the acoustic model?

Update Ok. I figured out that there some models that are .tar.gz archives while others only provide a .bin. Why is this the case? How is a model without a .txt file usable? Can I use a just any other token.txt?

Update 2 It seems to accept other token.txt files too.

jacobkahn commented 4 years ago

@Mic92 — this is really cool. Thanks for sharing!

In general, we provide tokens files in the recipes where we also provide model downloads. If there are models for which you can't find the corresponding tokens file, let us know. One should be careful to use the tokens/lexicon file that was used to train the model at inference time.

Looks like you resolved most of those things - are you still having any issues we can help with?

Mic92 commented 4 years ago

@Mic92 — this is really cool. Thanks for sharing!

In general, we provide tokens files in the recipes where we also provide model downloads. If there are models for which you can't find the corresponding tokens file, let us know. One should be careful to use the tokens/lexicon file that was used to train the model at inference time.

Looks like you resolved most of those things - are you still having any issues we can help with?

No it was working after that. But it could not understand me really. I think librespeech recordings worked.

MiroFurtado commented 4 years ago

As best as I can tell, wav2letter cannot be compiled on MacOS anymore. Does that mean this demo can't be used?

jacobkahn commented 4 years ago

@MiroFurtado - the inference pipeline can be built on macOS and doesn't have any external dependencies - it should run. Let us know if there are any issues.

MiroFurtado commented 4 years ago

@MiroFurtado - the inference pipeline can be built on macOS and doesn't have any external dependencies - it should run. Let us know if there are any issues.

But this demo requires flashlight to build and flashlight can't be built without gloo and gloo cannot be built on MacOS? Maybe I'm missing something - but this is the only realtime decoding demo I had seen.

tlikhomanenko commented 4 years ago

@MiroFurtado

regarding dependencies for inference pipeline you can simply check the dockerfile with all dependencies https://github.com/facebookresearch/wav2letter/blob/master/Dockerfile-Inference-Base - there is no flashlight and arrayfire, so we did inference pipeline to be standalone from flashlight.

MiroFurtado commented 4 years ago

@tlikhomanenko OK - I'll give building off of the inference pipeline a look.

My comments were merely about this specific audio demo, which since it requires all of wav2letter to build, can't build on macOS anymore b/c of the gloo requirement.

ariefsaferman commented 2 years ago

can i run it on linux 18.04? mannn i need it on linux. Do you have the demo for the linux?

ariefsaferman commented 2 years ago

can i run it on linux 18.04? mannn i need it on linux. Do you have the demo for the linux?

especially in python language

flashlight / wav2letter

PSA: Realtime audio frontend demo for macOS #327