In particular the implementation should be general enough that we can shard a large dataset on several machines, compute the partial gradient on each machine and combine its value before making a leapfrog step, as in https://arxiv.org/pdf/2104.14421.pdf
In particular the implementation should be general enough that we can shard a large dataset on several machines, compute the partial gradient on each machine and combine its value before making a leapfrog step, as in https://arxiv.org/pdf/2104.14421.pdf