flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Details about the network in the paper #265

Closed lucgeo closed 5 years ago

lucgeo commented 5 years ago

Hi,

I'm looking for more details about the learnable front-end and also for the language modelling network presented in this paper 1. Is there a config file corresponding to it? I'm interested in type of layers, dimensionality, number of them, etc.

Thank you!

sanjaykasturia commented 5 years ago

Looking through the architecture files in the recipes for LibriSpeech, TIMIT & WSJ (for example recipes/wsj/configs/conv_glu/network.arch), I see that all begin with the following first layer: V -1 1 NFEAT 0 From this, I am inferring that the first dimension is the sample dimension, i.e incrementing the index in this dimension moves you to the next sample in time. The second is "1" so is trivial and the third (NFEAT) corresponds to the number of filter banks and hence the MFCC coefficients. Is this correct?

What is not clear is what is in the 4th dimension? And what is the value? A "0" indicates that the value corresponds to the incoming tensor but that value isn't easy to see. Does this dimension represent delta and delta-delta of the DCT coefficients? Or are the delta and delta-delta stacked on the DCT coefficients in the 3rd dimension and this dimension has user information or something else?

Another commonality in the provided architectures is the Reordering layer towards the end: RO 2 0 3 1 I'm having difficulty understanding the rationale for the reordering as the reordering is followed only by a WN and a GLU and both have DIM as an index. Couldn't we skip the reordering and get the same effect by changing the DIM parameter in the GLU and WN layers that follow the reorder?

jacobkahn commented 5 years ago

@lucgeo — we still haven't open-sourced the learnable frontend yet, we'll revisit doing so soon.

@sanjaykasturia — I'll answer your questions in order:

The format for featurized data is based on how we compute MFCCs, which is why this reordering is required.

The second RO you point out towards the end of the network before the linear layers actually serves another purpose: the criterion input expect that ordering; there is an arrangement which allows us to run the criterion computations more efficiently. We actually have this reordering in architectures that don't have a linear layers as well.

sanjaykasturia commented 5 years ago

Thanks Jacob!