flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.35k stars 1.02k forks source link

How to calculate 500ms_context from am_500ms_future_context.arch? #985

Open yuseungwoo opened 2 years ago

yuseungwoo commented 2 years ago

Question

[A clear, concise description of your setup and question]

Thank you for reading my question in advance.

I have a question.

Paper "https://research.fb.com/wp-content/uploads/2020/01/Scaling-up-online-speech-recognition-using-ConvNets.pdf" and Example, https://github.com/flashlight/wav2letter/tree/master/recipes/streaming_convnets/librispeech

Above paper and example say that am_500ms_future_context.arch, this model has the 500ms future context.. but I don't understand why the model has 500ms_future_context.

Could you explain how the model has 500ms future context using above architecture, am_500ms_future_context.arch ?

Best Regrad

Seung Woo

Additional Context

[Add any additional information here]

tlikhomanenko commented 2 years ago

Hey,

You need to calculate what is the receptive field in your convolution network, so define which the future tokens / past tokens are used in the computations for particular output frame.

I believe in our code it was done automatically, as we define function for conv to compute its receptive field depending on the conv params and then propagate to the next layer. cc @vineelpratap if I am wrong.

airlab-byeol commented 2 years ago

@tlikhomanenko Thank you for peaking up a good point. I calculated receptive field for one particular output frame. image Based on my math, it has about 1.5sec. receptive field. Is this related to 500ms anyhow?

yuseungwoo commented 2 years ago

Dear @tlikhomanenko

Appreciate your contribution of this paper and answering to my question.

I'm so surprised with your work and studying your model, am_500ms_future_context.arch.

Especially, I'm interested in model diet and model inference speed

Here, I want to ask you something.

According to your paper, future context 250msec quite as good as 500msec arch. However I can't find it.

Where can I find this model or could you provide this?

Sincerly

Seung Woo

nguyenhuy1209 commented 2 years ago

@tlikhomanenko Thank you for peaking up a good point. I calculated receptive field for one particular output frame. image Based on my math, it has about 1.5sec. receptive field. Is this related to 500ms anyhow?

Hi @airlab-byeol, would you mind explaining clearly how did you come up with the figure? I feel like it is really close to the answer but don't understand why did you use 100 frames as input. Thank you.