Why add a PRELU output to a linear output?

The output of the partial decoder (https://github.com/AngeLouCN/CaraNet/blob/main/lib/partial_decoder.py#L30) has a linear activation. The output of each of the axial attention modules, which are designed for residual learning, go through a BN-PReLU (https://github.com/AngeLouCN/CaraNet/blob/main/CaraNet.py#L47)

The output (decoder + axial transformer 1, 2, 3, 4) then gets a sigmoid activation to generate class probabilities.

Why modify the original linear output (from the partial decoder) with a nonlinear function that biases positive (PRELU)? Doesn't this mean that you're more likely to saturate the sigmoid by having a large input? Or at the very least result in exploding biases for the partial decoder?

My understanding of residual learning is that it's commonly done with no activation functions prior to the summation to prevent exploding biases (continually adding a positive value to a positive value)

AngeLouCN / CaraNet

Why add a PRELU output to a linear output? #23