[Question] : calculate output error in backward should be partial of activation function rather than activation function itself ?

hi, team, in fully_fused_mlp.cu , the following looks not understandable:

    // If the output width is larger than 16 dims, we use cutlass to backpropagate through the last layer
    // rather than fusing it with our kernel.
    if (m_output_width > 16) {
        fc_multiply<FullLayer>(stream, output_weight_matrix(use_inference_params).transposed(), tmp_dL_doutput, forward.hidden.at(tmp_idx), backward_tmp.at(backward_tmp_idx), m_activation, true);
    }

I suppose it's computing: forward.hidden.at(output_layer) = output_matrix.T * tmp_dl_doutput for a 2-hidden layer mlp network, it's something like:

$$ \delta^{l2} = \frac{\partial L}{\partial a^{l2}} = (W^{l3})^T \delta^{l3} * Relu'(a^{l2}) $$

so for fc_multiply , the epilogue suppose to be the derivate of Relu(activation function), rather than Relu itself ?

Thanks for guidance ZJ

NVlabs / tiny-cuda-nn

[Question] : calculate output error in backward should be partial of activation function rather than activation function itself ? #465