Open marxav opened 4 years ago
Not necessarily @marxav .If the derivative of the cost function to the activation of the output layer already takes into account "m", ie 1.) d(cost_fn)/d(activation) = (1/m)*((1-y/1-a) - y/a), then there is no need to again divide the parameters or other gradients by "m", because when it gets divided by "m" in 1.) ,it gets propagated to all the parameters and gradients.
Thank you for this wonderful example, which helped me understanding the gradient descent implementation. I just noticed a minor mistake:
should be:
In addition:
should also be:
Otherwise, the code will not work, for instance if one wants to extend it to implement a regression use-case instead of a classification use-case (i.e. "none" instead of "softmax" in the final layer + court-circuiting the final activation function in the code).