ghost commented 3 years ago

I'm still having some trouble reconciling the discrepancy between the Arik et al. CRNN paper's claimed number of parameters and the number of parameters in my implementation. I did realize I made an error in specifying which dimension is time vs. frequency when defining the stride/kernel size, and after correcting this, the model went up to ~143k parameters. But, the paper says their model, with the same configuration, has ~229k params.

I've tried everything I can think of to identify the error, but there's nothing in the paper that I've found that indicates the model I've built is missing anything. If any of you have a moment to take a look, another pair of eyes looking at the model would be very helpful.

ghost commented 3 years ago

For reference, the summary is:

For just the encoder piece, contains:

Single convolutional layer with 32 filters, input is (batch, freq/features, time, channels). Stride is (5, 20), kernel size is (2, 8).
Output is reshaped to be (batch, time, features), where the features are "i-th feature vector is the concatenation of the i-th columns of all the maps" (from original CRNN paper Shi et al, had a much more useful description of how to reshape!).
Two Bidirectional GRU layers, 64 hidden units each.

Model: "sequential" Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 18, 17, 32) 3232

permute (Permute) (None, 17, 18, 32) 0

reshape (Reshape) (None, 17, 576) 0

bidirectional (Bidirectional (None, 17, 64) 117120

bidirectional_1 (Bidirection (None, 64) 18816
Total params: 139,168 Trainable params: 139,168 Non-trainable params: 0


**Detect piece:**
- Single fully-connected layer, 64 hidden units.
- Sigmoidal layer for output.

Model: "sequential_1"

Layer (type) Output Shape Param #

dense (Dense) (None, 64) 4160

dropout (Dropout) (None, 64) 0

dense_1 (Dense) (None, 1) 65

Total params: 4,225 Trainable params: 4,225 Non-trainable params: 0


**All together:**
```________________________________________________________________
Model: "arik_crnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
sequential (Sequential)      (None, 64)                139168    
_________________________________________________________________
sequential_1 (Sequential)    (None, 1)                 4225      
=================================================================
Total params: 143,393
Trainable params: 143,393
Non-trainable params: 0

ghost commented 3 years ago

I ended up shooting an email to the author in hopes he could provide more details on the architecture than are in the paper. Will post here if I hear back!

MerlinPCarson / WakeWord-Detection

Parameters for CRNN #11

Layer (type) Output Shape Param #

dense_1 (Dense) (None, 1) 65