Open Howe-Young opened 5 years ago
consider residual block as in deep speaker, but with less channels
mean pooling over time axis instead of last frame of LSTM