Closed piyush-das closed 3 years ago
IIRC, this was because of how dimensions are represented in pointwise filters. It may have been a design decision because it was difficult to make mobilenets work properly. Technically, benefits from orthogonality (as shown by work on mean field theory) arise when the network is overparameterized, a restriction mobilenets don't satisfy. You can find several details in the appendix of the paper.
The appendix states the following :
MobileNet-V1 has depth-separable filters that use M depthwise filters of dimensions 3×3×1 to process an input with M channels (see Figure 5). Each filter processes its corresponding channel,resulting in an output with M channels as well. This output is processed by N pointwise filters of dimensions1×1×N filters
However if previous depthwise convolutions had M output channels, the dimension of 1 pointwise convolution should have been 1x1xM and we would have had N such pointwise filters. I think the current implementation is assuming the kernel dimenstion as 1x1xN as has been mentioned in the appendix and hence weight.shape[1] == N
, however ideally kernel dimension as per my understanding is 1x1xM and hence weight.shape[1] !=N
[rather weight.shape[0]==N
]. Is my understanding correct, or am I missing something ?
I see the point you are making and believe you are correct. In case you try it out and find it works better, please let me know and I will update the code.
Hi,
According to the paper, while calculating α(l) Ml is the number of filters in layer. However in the implementation for pointwise convolution it appears that we are using
weight.shape[1]
which is ideally theCin
and not the number of filters which should ideally have beenweight.shape[0]
. Is this by design ?Thanks