ggml/ex: calculate accuracy in graph, adapt MNIST

This PR extends the MNIST example to allow the calculation of the model accuracy within a GGML graph instead of in user code. This is particularly relevant when shuffling data since it's much easier to guarantee that the correct values are being compared. The accuracy is calculated by first calculating ARGMAX (CUDA implementation added) of the logits and the labels and then counting the number of equal elements (new GGML op COUNT_EQUAL). Because the output is an integer the result does not change depending on the number of threads/CUDA blocks and the use of atomic adds. I considered an implementation where the accuracy is calculated in a fused way from the logits and labels but I think that that would be less reusable with a negligible difference for overall performance. For training all relevant statistics can now be obtained by just copying them from the output tensors.

I added a warp_reduce_sum(int) function for CUDA which makes calls with a half argument ambiguous so it has become necessary to do explicit casts to float in some places.

ggerganov / ggml

ggml/ex: calculate accuracy in graph, adapt MNIST #980