Online decoder frame by frame

dambaquyen96 commented 5 years ago

I'm currently working on streaming decode, but I got stuck when feeding the data with small chunk, it returns very diffrence result compare with full audio decoding. In eda528d, the commit's message note that it has a module named "ASR Decoder wrapper (wav2letter/src/fb)", with an unit test "Make sure we have same results by feeding emissions frame by frame". But I can't found any module like that in the source code. So the code is removed? How can I feeding frame by frame audio and get the correct sentence?

xuqiantong commented 5 years ago

Hi @dambaquyen96, as you see the code is in folder src/fb, which means it is used in Facebook internally and cannot be open-sourced with the rest of w2l code. But I will try to see if I can bring the tests outside.

Btw, just to make it clear, "feeding frame by frame" means we need to feed the emission matrix frame by frame to the decoder rather than the raw audio.

lunixbochs commented 5 years ago

I have streaming recognition against a live mic working using a VAD, and talked about it briefly here: https://github.com/facebookresearch/wav2letter/issues/266

@trishume shared with me a brief analysis drawn from the network architecture in the librispeech conv_glu recipe and the wav2letter whitepapers:

the receptive field of the wav2letter net in the repo is 170 frames on each side, at a stride of 10ms between frames that’s 1700ms of past and future context

since they keep increasing the number of channels and the kernel size in each layer with no pooling or strides, that means most of the computational work is at the top of the network, where streaming can do the least because of the amount of future context needed

He also said the "low-dropout" network described in https://arxiv.org/pdf/1712.09444.pdf has two fewer layers which reduces the context by 280ms on each side. I assume these acoustic models were selected by the Facebook Research team for their overall accuracy when decoding entire sentences, and not their ability to decode very short utterances at low latency without context. You might need a different architecture for that.

The latency when doing CPU inferencing on my 2015 macbook is around 650ms with my VAD approach, which honestly feels fine for speaking sentence-as-a-time, especially because my VAD will trigger multiple times during the sentence and still get reasonable emissions.

One datapoint: I tested with a librispeech network where each of the layers was 25% smaller, I got test-TER-clean down to around 5% (instead of the 2.64% I was able to get with the recipe's high dropout network), but it was 2.5x faster at inference on my CPU: it cost about 250ms to run network->forward instead of 650ms.

I have two plans in mind for faster streaming recognition:

Continue using a VAD to pre-chunk audio, but use two separate trained models on the backend.
- The first model would be much smaller and possibly trained slightly differently, and I would treat the output as a hypothesis to show the user with lower latency.
- The second model would be the fully trained, more expensive model, and I would run it less frequently.
Stick with one large model, but compress/quantize it (https://github.com/facebookresearch/wav2letter/issues/267) to make it smaller/faster. Possibly shrink the original network arch slightly as a compromise.

flashlight / wav2letter

Online decoder frame by frame #274