feevos / resuneta

mxnet source code for the resuneta semantic segmentation models
Other
117 stars 30 forks source link

Question about PSP Pooling initial input #1

Closed se7enXF closed 4 years ago

se7enXF commented 4 years ago

I want to implement your work in tensorflow but get a problem.In your paper, 'the initial input is split in channel (feature) space in 4 equal partitions', is the description of initial input for each branch of PSP Pooling. I found that it is different from PSPNet. I do not know why you split the feature map but follow the idea. Here comes the problem.
Let a feature map shape is F(batch, width, height, channel). I think your idea is to split feature map like F[b, w, h, c/4], but different from your code.

b  = F.split(_a,axis=2,num_outputs=2) # Split First dimension 
c1 = F.split(b[0],axis=3,num_outputs=2) # Split 2nd dimension
c2 = F.split(b[1],axis=3,num_outputs=2) # Split 2nd dimension

d11 = c1[0]
d12 = c1[1]
d21 = c2[0]
d22 = c2[1]

If I am right that data in mxnet is in shape (batch, channel, width, height), I question that why you split feature map in width and height or I miss something?

Sorry to trouble you and thanks for your work!

se7enXF commented 4 years ago

OK! I find out the answer.
You just split the feature map into different scales tiles using global pooling to implement PSP Pooling. It has the same function with fixed scales feature map using different size pooling.

feevos commented 4 years ago

Hi @se7enXF , apologies for late reply I just saw this. You can find a simple explanation of the PSPPooling in psp_pooling_understanding_nonHybrid.py file.

In mxnet, the dimensions are (Batch, NChannels, Height, Width). The input is subsampled (in H,W) space with max pooling in scales per dimension 1/1, 1/2, 1/4 etc (this pooling leaves the number of channels invariant),

 x = F.Pooling(_input,kernel=[pool_size,pool_size],stride=[pool_size,pool_size],pool_type='max')

Then the result is upsampled in the original resolution:

 x = F.UpSampling(x,sample_type='nearest',scale=pool_size) 

and then this output is passed through a convolution layer that reduces the number of initial channels to 1/4 of their initial number.

Then these 4 outputs are concatenated with the initial input resulting in twice as many channels as the initial number of filters:

 out = F.concat(p[0],p[1],p[2],p[3],p[4],dim=1)

Finally, a last convolution brings the total number of channels equal to initial number of channels.

The actual implementation in mxnet is a bit more involved in order to get the hybridization feature, but the philosophy is the same.

Hope this helps.