examples: add dataset, data shuffling to MNIST

This PR adds an explicit datatype to represent a dataset to the MNIST example. The dataset also has support for shuffling the data. This works by separating the data into shards and creating a vector of integers as a permutation of those shards. The dataset is shuffled by shuffling the permutation which then dictates which shards are retrieved with get_batch. With a shard size of 1 this then requires a lot more individual data transfers per batch, making the CUDA training ~10x slower compared to transferring the entire batch in one contiguous block. With a shard size of 10 the CUDA training is still ~2x slower but this is largely compensated by faster convergence per epoch. For actual use the overhead from shuffling is probably negligible, especially once asynchronous data loading is implemented.

My next goals will be to add dataset support to GGML, create a more high-level API, add ggml_backend_sched support, and add more tests. I think after that it would make sense to start looking into actual applications such as finetuning in llama.cpp.

ggerganov / ggml

examples: add dataset, data shuffling to MNIST #982