freewym / espresso

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Other
941 stars 116 forks source link

Batchnorm and masking #46

Open danpovey opened 3 years ago

danpovey commented 3 years ago

It looks like the batchnorm doesn't take into account the masking:

https://github.com/freewym/espresso/blob/6fca6cacd9d475d2676c527999e2d1bde08e7cbb/espresso/models/speech_tdnn.py#L170

Surely this isn't right? However I don't know how to take it into account.

freewym commented 3 years ago

I think Batchnorm is per dimension, so the masked part will not affect the unmasked part?

danpovey commented 3 years ago

That's the time-index, not the dimension.

On Mon, Nov 9, 2020 at 10:18 PM Yiming Wang notifications@github.com wrote:

I think Batchnorm is per dimension, so the padding part will not affect the "real" part?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/issues/46#issuecomment-724041349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5N6MCSXHVDFPG7CUDSO723LANCNFSM4TPMB55A .

freewym commented 3 years ago

Oh OK. How does Kaldi deal with it? Guarantee same length within a batch so no padding?

danpovey commented 3 years ago

Issue doesn't arise in Kaldi, we use regular minibatches. In Lhotse our plan would be to use lhotse itself to do padding (which would add silence or noise). So we wouldn't be messing about with zeros throughout the propagation,it would be a real signal. But I expect batchnorm may support a mask matrix.

On Mon, Nov 9, 2020 at 11:50 PM Yiming Wang notifications@github.com wrote:

Oh OK. How does Kaldi deal with it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/issues/46#issuecomment-724098118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3G7C4MXSNFEGGLTPTSPAFSVANCNFSM4TPMB55A .