espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.44k stars 2.18k forks source link

Is there a plan for Online Decoding? #324

Closed Ina299 closed 5 years ago

Ina299 commented 6 years ago

Hi, is it possible to use ESPnet for realtime ASR? I find that Kaldi can do online decoding by using https://github.com/alumae/kaldi-gstreamer-server and
https://github.com/alumae/gst-kaldi-nnet2-online I guess if I modify gst-kaldi-nnet2-online plugin properly, we can do online decoding by ESPnet. Do you have such a plan for online decoding function?

b-flo commented 6 years ago

Hi,

You don't really need Kaldi Gstreamer and alumae's wrapper of Nnet2decoder to do online decoding with Kaldi models, using online2-wav-nnet2-latgen is all you need for nnet2 models. As for chain models, use online2-wav-nnet3-latgen but you have to set acoustic-scale to 1.0 and shift_factor to 0.03 with default settings.

I don't think you should use any of these binaries to build an online decoder for ESPnet as they are designed for specific "Kaldi approaches" models. But the source code of these binaries could help you if you want to try to do online decoding using lattices or/and in Kaldi style.

End-to-end approaches are not really designed for online decoding imo but It can be done. One thing : be careful with any batch-wise or global operations such as CMVN when training your models, it would be difficult to compute them for new data during online decoding!

sw005320 commented 6 years ago

@b-flo is right that the ESPnet model (especially encoder BLSTM and attention) is not designed for on-line processing, and this requires several research-level efforts. For now, we're focusing on the improvement of the performance and more applications. The online decoding is not an urgent scope, but we'll raise the priority of this online decoding function given requests.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Ina299 commented 6 years ago

Thank you for your reply. OK, I understand that ESPNet is not focusing on online decoding for now. I need Gsteamer to use it from Websocket and it works on chain model. Maybe I need to write decoder for ESPNet by myself on both cases (online2-wav-nnet3-latgen and Gstramer). On research level, I guess we need to get rid of auto regressive component ... like Parallel-Wavenet done on the speech synthesis.

b-flo commented 6 years ago

On research level, I guess we need to get rid of auto regressive component ... like Parallel-Wavenet done on the speech synthesis.

I'm not familiar with Parallel-Wevenet so I could not say.

The main problem is that, for standard attention-based model, when decoding you integrate all the inputs h into context using the attention mechanism so you need to wait for the entire input to be seen before decoding. Chan. W & Lane I. showed an online approach using a sliding window when computing context, defined as w_j = {h_m_{j-p}, ... h_m_{j+p} with m_j the median of the previous alignment, and p, q the hyperparameters controlling the window size. However It can obviously only work using an unidirectional RNN which is detrimental for the attention mechanism compared to bidirectionnal RNN as demonstrated in several papers (e.g [1]). Futhermore, as stated by the authors, this method will have difficulties to predict next token when there is a gap or silences of more than q frames between characters.

Maybe I need to write decoder for ESPNet by myself on both cases (online2-wav-nnet3-latgen and Gstramer).

I don't really think you should focus on online2-wav-nnet[2-3]-latgen to begin with. The main bridge between Kaldi and Kaldi's Gstreamer is the wrapper function for SingleUtterance[GMM, NNet2, NNet3]Decoder which is model dependant. Without talking about the online model and decoder, it would be more logical to make from scratch a similar wrapper for ESPNET models.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue is closed. Please re-open if needed.

ahmedalbahnasawy commented 4 years ago

@b-flo Is there a plan for Online Decoding now ? decoding stage using RNN T model is very slow I tried beam search and greedy search and both of them are slow. I used 1 gpu for decoding but the GPU utilization is like 4% ? which model you think it could be fast, support online decoding and accurate ? thanks

b-flo commented 4 years ago

Hi,

It was delayed because of some other projects, I started working on the streaming decoder for RNN-T yesterday. I'll implement several architecture and techniques before proposing a version to ESPNET so it won't be available before several weeks.

About GPU decoding, I don't use it and it was disabled until recently (for a reason I can't recall and I still can't access my development notes. Thanks HP). Maybe @rai4 can comment on that part!

rai4 commented 4 years ago

@ahmedalbahnasawy Hi,

decoding stage using RNN T model is very slow I tried beam search and greedy search and both of them are slow. I used 1 gpu for decoding but the GPU utilization is like 4% ?

I also think that greedy search and beam search are slow, but we need to tell it the exact number. If possible, can you tell me how long it takes? For example, we can talk about the real time factor. See this paper for the real time factor. You have probably seen this already. https://arxiv.org/pdf/1811.06621.pdf As a side note of my results, greedy search is much faster on GPU, and beam search is faster on CPU. In a greedy search, the real time factor in the CPU is less than 0.7. (Korean Language, Word Piece Unit)

ahmedalbahnasawy commented 4 years ago

@rai4 Hi, Sorry for the delay, I tested multiple audio waves using beam and greedy search decoding using RNN-T model. Greedy search is faster on GPU as you mentioned. RTF after loading the model once (all the audio waves are less than 8 seconds) is 0.5. For beam search on CPU it took more than 5 minutes to decode 8 second of audio. @b-flo @rai4 @sw005320 i would like to share this paper https://arxiv.org/abs/1909.08723 they claiming that ESPRESSO is faster than ESPNET (4-11) for decoding.

b-flo commented 4 years ago

Hi,

For beam search on CPU it took more than 5 minutes to decode 8 second of audio.

I didn't observe such high RTF in my experiments however you should be aware that current RNN-T implementation was not designed to be fast. Outside architecture choice and optimization, there are several parts that will be redesigned for streaming/online decoding purpose.

sw005320 commented 4 years ago

I, @ShigekiKarita, and @takaaki-hori are now working on faster decoding. Actually @takaaki-hori's batch decoding (https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/rnn/decoders.py#L497) is quite fast (e.g., RTF is 0.1) but it only supports the old beam search (API v1) and RNN. We're now trying to 1) implement @takaaki-hori's batch decoding on top of API v2 2) try to support both RNN and transformer 3) (we need some discussions with @b-flo, but) follow a similar step to RNNT

Of course 3) is challenging since RNN/Transformer is based on label synchronous while RNNT is based on input synchronous. But I think we can manage to do it. With this speed up, we could easily extend it to on-line streaming decoding. @ahmedalbahnasawy, if you're interested in the project, please let me know. We want to have some help with this project.

b-flo commented 4 years ago

Hi,

Help on batch decoding for RNNT would be appreciated, notably to implement in the manner @sw005320 described (extending to a standard beam search would be quite straightforward in contrast). Right now, it's not a priority on my to-do list, I'm focused on optimizing/refining current implementation, incorporating new features and enabling streaming/online recognition on low-ressource devices .