feat: refactor cross entropy, add CUDA, fix grad test

I'm currently working on enabling MNIST training for backends other than CPU. As part of that I'm adding CUDA support for cross entropy loss which I'm spinning out into a separate PR since I think it will make reviewing easier. This PR makes the following changes:

Refactor the cross entropy CPU code to optimize out the logarithm in $\log(\exp(\mathrm{logit}_i - \mathrm{max}))$. This also removes the need for the workaround where the softmax is scaled from $[0,1]$ to $[\epsilon, 1]$, simplifying the code.
Increase eps for the test in tests/test-grad0 to reduce the impact of machine precision on the numerical gradient calculation. Increase the range of the logits to ensure that cross entropy is sufficiently linear on the scale of eps. On master cross entropy is failing for 123/1000 iterations, with the tuned parameters it's failing for 0/1000 iterations (both for the code on master and the code in this PR).
Add a CUDA implementation for cross entropy loss. Expose sum_rows_f32_cuda so that the code can be re-used to combine the partial cross entropy results into a scalar value.

ggerganov / ggml

feat: refactor cross entropy, add CUDA, fix grad test #929