The paper mentions that the loss layer is combined with the sigmoid computation and not softmax. More speciafically this line
Finally,
we note that the implementation of the loss layer combines
the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.
So isn't the author saying that we should use sigmoid activation over the last layer. The softmax usage maybe could lead to a lower accuracy.
The paper mentions that the loss layer is combined with the sigmoid computation and not softmax. More speciafically this line
So isn't the author saying that we should use sigmoid activation over the last layer. The softmax usage maybe could lead to a lower accuracy.