from the paper page 9,table4 (https://arxiv.org/pdf/1512.02595.pdf) it describes the filter dimension and stride as below
(the first dimension is frequency and the second dimension is time)
But in the code, deepspeech.torch/DeepSpeechModel.lua from line 25 to line 28
conv:add(nn.SpatialConvolution(1, 32, 11, 41, 2, 2))
conv:add(nn.SpatialBatchNormalization(32))
conv:add(nn.Clamp(0, 20))
conv:add(nn.SpatialConvolution(32, 32, 11, 21, 2, 1))
it seems to have different stride scheme because the last line translated to the paper's description
would be
from the paper page 9,table4 (https://arxiv.org/pdf/1512.02595.pdf) it describes the filter dimension and stride as below (the first dimension is frequency and the second dimension is time)
(Architecture) (Channels) (Filter dimension) (Stride) ... (2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 2x1) ...
But in the code, deepspeech.torch/DeepSpeechModel.lua from line 25 to line 28 conv:add(nn.SpatialConvolution(1, 32, 11, 41, 2, 2)) conv:add(nn.SpatialBatchNormalization(32)) conv:add(nn.Clamp(0, 20)) conv:add(nn.SpatialConvolution(32, 32, 11, 21, 2, 1)) it seems to have different stride scheme because the last line translated to the paper's description would be
(Architecture) (Channels) (Filter dimension) (Stride) ... (2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 1x2) ...
I'm wondering that it is my misunderstanding or it is different scheme to get better performance Thank you for answering in advance.