SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.
MIT License
259 stars 73 forks source link

Question for different scheme between the code and the paper: the filter dimension and stride #81

Closed Soonhwan-Kwon closed 7 years ago

Soonhwan-Kwon commented 7 years ago

from the paper page 9,table4 (https://arxiv.org/pdf/1512.02595.pdf) it describes the filter dimension and stride as below (the first dimension is frequency and the second dimension is time)

(Architecture) (Channels) (Filter dimension) (Stride) ... (2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 2x1) ...

But in the code, deepspeech.torch/DeepSpeechModel.lua from line 25 to line 28 conv:add(nn.SpatialConvolution(1, 32, 11, 41, 2, 2)) conv:add(nn.SpatialBatchNormalization(32)) conv:add(nn.Clamp(0, 20)) conv:add(nn.SpatialConvolution(32, 32, 11, 21, 2, 1)) it seems to have different stride scheme because the last line translated to the paper's description would be

(Architecture) (Channels) (Filter dimension) (Stride) ... (2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 1x2) ...

I'm wondering that it is my misunderstanding or it is different scheme to get better performance Thank you for answering in advance.