Details about the network in the paper

lucgeo commented 5 years ago

Hi,

I'm looking for more details about the learnable front-end and also for the language modelling network presented in this paper 1. Is there a config file corresponding to it? I'm interested in type of layers, dimensionality, number of them, etc.

Thank you!

sanjaykasturia commented 5 years ago

Looking through the architecture files in the recipes for LibriSpeech, TIMIT & WSJ (for example recipes/wsj/configs/conv_glu/network.arch), I see that all begin with the following first layer: V -1 1 NFEAT 0 From this, I am inferring that the first dimension is the sample dimension, i.e incrementing the index in this dimension moves you to the next sample in time. The second is "1" so is trivial and the third (NFEAT) corresponds to the number of filter banks and hence the MFCC coefficients. Is this correct?

What is not clear is what is in the 4th dimension? And what is the value? A "0" indicates that the value corresponds to the incoming tensor but that value isn't easy to see. Does this dimension represent delta and delta-delta of the DCT coefficients? Or are the delta and delta-delta stacked on the DCT coefficients in the 3rd dimension and this dimension has user information or something else?

Another commonality in the provided architectures is the Reordering layer towards the end: RO 2 0 3 1 I'm having difficulty understanding the rationale for the reordering as the reordering is followed only by a WN and a GLU and both have DIM as an index. Couldn't we skip the reordering and get the same effect by changing the DIM parameter in the GLU and WN layers that follow the reorder?

jacobkahn commented 5 years ago

@lucgeo — we still haven't open-sourced the learnable frontend yet, we'll revisit doing so soon.

@sanjaykasturia — I'll answer your questions in order:

The network expects a WHCN ordering (noting that ArrayFire is column-major) of data for convolutions and activations: that is, a tensor whose axis are ordered as W[width] H[height] C[channels] N[batch]. Breaking this down further:
- W[width] is the temporal dimension of the data: each entry in the tensor along this axis is a particular frame in time.
- H[height] in the case of audio frames over time, this has no meaning; height is trivially 1 for our 1D convolutions, as you note.
- C[channels] In the case of log-mel filterbanks, filter values for a given frame are emplaced along the third dimension. Each entry represents the magnitude of a particular transformed frequency.
- N[batch] - entries along this axis correspond to multiple samples in a batch.

The format for featurized data is based on how we compute MFCCs, which is why this reordering is required.

The second RO you point out towards the end of the network before the linear layers actually serves another purpose: the criterion input expect that ordering; there is an arrangement which allows us to run the criterion computations more efficiently. We actually have this reordering in architectures that don't have a linear layers as well.

sanjaykasturia commented 5 years ago

Thanks Jacob!

flashlight / wav2letter

Details about the network in the paper #265