getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Multi node multi gpu example #264

Closed parthe closed 10 months ago

parthe commented 1 year ago

I am running an application on a slurm cluster with 4 gpus per node. Currently, I can utilize 4 gpus easily. I wanted to expand to larger kernel matrices. In theory it seems doable since at a time I can get up to 10 nodes with 4 gpus each. However I don't know how to setup keops to do this.

Can you share an example code where keops is able to use n nodes with m gpus each to store a large kernel matrix and to transfer arrays across the gpus associated with different nodes.

jeanfeydy commented 1 year ago

Hi @parthe,

That's an interesting question! To be honest, we don't have much experience with multi-node computations: all the tasks that @joanglaunes, @bcharlier or myself are interested in are embarrassingly parallel over e.g. training samples in a batch, so we never investigated communications between nodes.

As a consequence, KeOps does not handle multi-GPU or multi-node parallelism natively: in order to process an extremely large kernel matrix, you will have to cut the input datasets in "slices" by hand.

I believe that the PyTorch DistributedDataParallel module addresses this issue. Good documentation seems to be provided by the PyTorch distributed overview as well as the documentation of the Jean Zay cluster (that we use to render the KeOps website).

What do you think? If you succeed or encounter a KeOps-specific issue with such a multi-node computation, we'll be happy to hear about it. Best regards, Jean

parthe commented 1 year ago

Thanks Jean, I'll go through this in more detail.

Currently we are slicing up data onto multiple gpus of a single node. Pytorch lets us do that easily. But most nodes on our cluster only have 4 gpus which restricts the level of parallelism we can achieve. We seem to be getting a linear speed-up in our kernel machine with number of gpus. A natural extension hence is to work with a multi-node setup.

I will try to see if we can do that easily and integrate keops into this infrastructure. If we succeed, I'm happy to share a gist.

Best, Parthe