hi, team, in fully_fused_mlp.cu , the following looks not understandable:
// If the output width is larger than 16 dims, we use cutlass to backpropagate through the last layer
// rather than fusing it with our kernel.
if (m_output_width > 16) {
fc_multiply<FullLayer>(stream, output_weight_matrix(use_inference_params).transposed(), tmp_dL_doutput, forward.hidden.at(tmp_idx), backward_tmp.at(backward_tmp_idx), m_activation, true);
}
I suppose it's computing: forward.hidden.at(output_layer) = output_matrix.T * tmp_dl_doutput
for a 2-hidden layer mlp network, it's something like:
hi, team, in
fully_fused_mlp.cu
, the following looks not understandable:I suppose it's computing: forward.hidden.at(output_layer) = output_matrix.T * tmp_dl_doutput for a 2-hidden layer mlp network, it's something like:
$$ \delta^{l2} = \frac{\partial L}{\partial a^{l2}} = (W^{l3})^T \delta^{l3} * Relu'(a^{l2}) $$
so for
fc_multiply
, the epilogue suppose to be the derivate of Relu(activation function), rather thanRelu
itself ?Thanks for guidance ZJ