The output (decoder + axial transformer 1, 2, 3, 4) then gets a sigmoid activation to generate class probabilities.
Why modify the original linear output (from the partial decoder) with a nonlinear function that biases positive (PRELU)? Doesn't this mean that you're more likely to saturate the sigmoid by having a large input? Or at the very least result in exploding biases for the partial decoder?
My understanding of residual learning is that it's commonly done with no activation functions prior to the summation to prevent exploding biases (continually adding a positive value to a positive value)
Hi thank you for your interest. We directly use the CFP module code from our previous work which is a multi-class semantic segmentation task. In that case, PRELU has better performance than RELU.
The output of the partial decoder (https://github.com/AngeLouCN/CaraNet/blob/main/lib/partial_decoder.py#L30) has a linear activation. The output of each of the axial attention modules, which are designed for residual learning, go through a BN-PReLU (https://github.com/AngeLouCN/CaraNet/blob/main/CaraNet.py#L47)
The output (decoder + axial transformer 1, 2, 3, 4) then gets a sigmoid activation to generate class probabilities.
Why modify the original linear output (from the partial decoder) with a nonlinear function that biases positive (PRELU)? Doesn't this mean that you're more likely to saturate the sigmoid by having a large input? Or at the very least result in exploding biases for the partial decoder?
My understanding of residual learning is that it's commonly done with no activation functions prior to the summation to prevent exploding biases (continually adding a positive value to a positive value)