Can this architecture support streaming ASR?

burchim / EfficientConformer

[ASRU 2021] Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

https://arxiv.org/abs/2109.01163

Apache License 2.0

210 stars 32 forks source link

Can this architecture support streaming ASR? #12

Closed kafan1986 closed 2 years ago

kafan1986 commented 2 years ago

Hello @burchim

What all changes are required to support streaming inference?

harisgulzar1 commented 1 year ago

@kafan1986 did you mean if CTC decoder supports the streaming ASR? I have been working with the Conformer-CTC decoder and want to ask if it is possible to do streaming ASR with CTC decoder? @burchim would appreciate your comments about it.

burchim commented 1 year ago

Hi @harisgulzar1,

Streaming ASR is possible using CTC. There is currently no implementation of a streaming decoding function but this is absolutely possible !

Decoding can be performed chunk by chunk using the CTC encoder in convolution manner with a given context size and step size of audio frames given as hyper-parameters.

It is also possible to decode a streaming audio frame by frame with a model trained with causal context (masked future context in attention and convolutions). But this would require small changes in the implementation and configs. For now, all given configs train full context models.

harisgulzar1 commented 1 year ago

Hi @burchim Thanks for your comment. That clarifies my doubt, I will try to implement it and see how it goes.

debasish-mihup commented 1 year ago

@harisgulzar1 did you get the time to implement and test the streaming ASR? How is the performance?

harisgulzar1 commented 1 year ago

@debasish-mihup I haven't implemented it yet. But I will do it soon. In the meanwhile, I found this tutorial for building a streaming ASR pipeline. You may find it helpful. https://colab.research.google.com/github/pytorch/audio/blob/gh-pages/main/_downloads/bd34dff0656a1aa627d444a8d1a5957f/online_asr_tutorial.ipynb#scrollTo=joQ2X3uYAnfC

debasish-mihup commented 1 year ago

@harisgulzar1 I did take a look into the shared notebook. I have some doubts, are they only maintaining the context and using it during the decoder part of the pipeline or they are using context during the encoder phase as well?

harisgulzar1 commented 1 year ago

I have implemented the inference code based on the above google collaboratory example. But the accuracy is very poor for small iterative chunks of audio. I think for streaming application, we need to retrain the models with zero future context, by setting the causal paramter to True in the ecoder.py file. Is my intuition correct about it?

kafan1986 commented 1 year ago

@harisgulzar1 Did you get it to work reliably in streaming mode?