Multi Process GPU Training [Feature Request]

cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch

MIT License

3.6k stars 563 forks source link

Multi Process GPU Training [Feature Request] #2150

Open s769 opened 2 years ago

s769 commented 2 years ago

Is there a way to train on multiple GPUs across multiple processes (i.e. through torch.nn.parallel.DistributedDataParallel)?

jacobrgardner commented 2 years ago

There is support for this via https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/multi_device_kernel.py (see the tutorial notebook here).

Note that if your kernel is large enough to use checkpointing, you may be better off using KeOps on a single GPU just due to overhead: https://github.com/cornellius-gp/gpytorch/blob/master/examples/02_Scalable_Exact_GPs/KeOps_GP_Regression.ipynb

jacobrgardner commented 2 years ago

Oh, I see -- I missed the "multiple processes" bit, my bad! That's currently not supported, but it might be possible to do something similar to what is done in MultiDeviceKernel, which extends DataParallel, by extending DistributedDataParallel in a similar way

gpleiss commented 2 years ago

@s769 we'd be open to a PR, if you'd be willing to implement this!

XiankangTang commented 1 year ago

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

nikitrian commented 1 year ago

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

I get the same error. Did you manage to fix it?

XiankangTang commented 1 year ago

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

I get the same error. Did you manage to fix it?

No, I give up. The error I'm getting is that I can't regroup a lazy tensor on multiple GPUs into the output device. The approach I used later was to disassemble the image into pieces, then do Gaussian regression on each, and finally assemble them on the output device.

JoachimSchaeffer commented 9 months ago

Are there any updates on why the tutorial notebook fails?

XiankangTang commented 9 months ago

No, I have not made any progress.

semantic0 commented 1 month ago

I have just tried to run the CIFAR DKL on multiple GPUs (AMD300X) and I am getting the same issue as above. Though it is pretty fast to train on a single one