Question about Patch Extraction module in paper

JLREx / PAtt-Lite

Official implementation for PAtt-Lite: Lightweight Patch and Attention MobileNet for Challenging Facial Expression Recognition

MIT License

33 stars 3 forks source link

Question about Patch Extraction module in paper #5

Closed zerooooone closed 2 months ago

zerooooone commented 7 months ago

Thanks for your great work! I have trouble understanding Patch Extraction module in your paper. The output feature of MobileNet_v1 was padded to 16x16x512, why after a separable convolution, the width and height of feature becomes 4? How patches are partitioned in separable convolution? Could you elaborate on more details about Patch Extraction module?

robertomazo commented 7 months ago

They say:

the first separable convolutional layer is re sponsible for splitting the feature maps into four patches while learning higher-level features from its input.

Is possible setting the depth and pointwise layers of the separable convolution in order to reduce dimensions. I belive words "depthwise separable convolution" and "depthwise convolution" are a bit confusing.

Also i have trouble understanding the padding, since I've truncated mobilenet v1 and the output is alredy 16x16 and not 14x14.

zerooooone commented 6 months ago

They say:

the first separable convolutional layer is re sponsible for splitting the feature maps into four patches while learning higher-level features from its input.

Is possible setting the depth and pointwise layers of the separable convolution in order to reduce dimensions. I belive words "depthwise separable convolution" and "depthwise convolution" are a bit confusing.

Also i have trouble understanding the padding, since I've truncated mobilenet v1 and the output is alredy 16x16 and not 14x14.

The size of input image is 224x224, so the output of truncated mobilenet_v1 is 14x14.

robertomazo commented 6 months ago

They say:

the first separable convolutional layer is re sponsible for splitting the feature maps into four patches while learning higher-level features from its input.

Is possible setting the depth and pointwise layers of the separable convolution in order to reduce dimensions. I belive words "depthwise separable convolution" and "depthwise convolution" are a bit confusing. Also i have trouble understanding the padding, since I've truncated mobilenet v1 and the output is alredy 16x16 and not 14x14.

The size of input image is 224x224, so the output of truncated mobilenet_v1 is 14x14.

I am tried to replicate it and the output is not 14x14 (my case, probably doing something wrong)

JLREx commented 5 months ago

Hi, we have just uploaded the training notebook for your reference. Regarding this,

They say:

the first separable convolutional layer is re sponsible for splitting the feature maps into four patches while learning higher-level features from its input.

Is possible setting the depth and pointwise layers of the separable convolution in order to reduce dimensions. I belive words "depthwise separable convolution" and "depthwise convolution" are a bit confusing. Also i have trouble understanding the padding, since I've truncated mobilenet v1 and the output is alredy 16x16 and not 14x14.

The size of input image is 224x224, so the output of truncated mobilenet_v1 is 14x14.

Yes, the output is 14x14, we apply the padding before the patch extraction block, hence input to the patch extraction block is 16x16.