The number of channels for offset and modulation usually has 27 channel output as kernal size is 3. But according to my understanding of paper, for each channel we a 3 channel predictions i.e shift in x,y and the modulation factor. So the output channels should be 3 times the input however that is not what is coded in:
The output is of 3K channels, where the first 2K channels correspond to the learned offsets and the remaining K channels are further fed to a sigmoid layer to obtain the modulation scalars
Can someone clarify what I am missing and because this layer outputs 27 channels no matter the input dimension so how is this used for shift x,y and modulation?
The number of channels for offset and modulation usually has 27 channel output as kernal size is 3. But according to my understanding of paper, for each channel we a 3 channel predictions i.e shift in x,y and the modulation factor. So the output channels should be 3 times the input however that is not what is coded in:
The paper says
Can someone clarify what I am missing and because this layer outputs 27 channels no matter the input dimension so how is this used for shift x,y and modulation?