I'm currently working on enabling MNIST training for backends other than CPU. As part of that I'm adding CUDA support for cross entropy loss which I'm spinning out into a separate PR since I think it will make reviewing easier. This PR makes the following changes:
Refactor the cross entropy CPU code to optimize out the logarithm in $\log(\exp(\mathrm{logit}_i - \mathrm{max}))$. This also removes the need for the workaround where the softmax is scaled from $[0,1]$ to $[\epsilon, 1]$, simplifying the code.
Increase eps for the test in tests/test-grad0 to reduce the impact of machine precision on the numerical gradient calculation. Increase the range of the logits to ensure that cross entropy is sufficiently linear on the scale of eps. On master cross entropy is failing for 123/1000 iterations, with the tuned parameters it's failing for 0/1000 iterations (both for the code on master and the code in this PR).
Add a CUDA implementation for cross entropy loss. Expose sum_rows_f32_cuda so that the code can be re-used to combine the partial cross entropy results into a scalar value.
I'm currently working on enabling MNIST training for backends other than CPU. As part of that I'm adding CUDA support for cross entropy loss which I'm spinning out into a separate PR since I think it will make reviewing easier. This PR makes the following changes:
eps
for the test intests/test-grad0
to reduce the impact of machine precision on the numerical gradient calculation. Increase the range of the logits to ensure that cross entropy is sufficiently linear on the scale ofeps
. On master cross entropy is failing for 123/1000 iterations, with the tuned parameters it's failing for 0/1000 iterations (both for the code on master and the code in this PR).sum_rows_f32_cuda
so that the code can be re-used to combine the partial cross entropy results into a scalar value.