Multi-GPU Support - Githubissues

jafioti commented 1 year ago

Scaling models requires they be trained in data-parallel, pipeline parallel, or tensor parallel regimes. The last two, being both "model parallel", require a single model to be shared across GPUs. This seems more challenging currently, but data parallelism seems like a much more approachable challenge.

To enable data parallelism, tensors need to be able to move to individual Cuda devices, so the Cuda device will likely need to change to Cuda<const N: usize>. Gradients from other devices need to be moved back to one device to be averaged as well, and then potentially sent back to other devices to apply the gradients, or apply the gradients on one device and copy the updated model back to the other devices.

coreylowman commented 1 year ago

What was the reasoning for const generic ordinal? I think we can probably get away with it being runtime value, and then the tensor ops can just check if operands are on different devices.

Another question is how do we decide which device a tensor should go on?

Do we have to track free memory per device?
For binary ops, if both input tensors are on different devices, do we pick one of the two, or maybe even a third one?

Also thoughts on frontend interface? Maybe something like:

struct Distributed<D> {
      options: Vec<D>,
}

// default would use all available
let dev: Distributed<Cuda> = Default::default();

We could then handle device copying inside the kernel implementations of Distributed<D>

jafioti commented 1 year ago

As for the const generic, no particular reason other than Cuda is a type, and so Cuda:0 could also be a type. But I see no issue with runtime values.

I was thinking the simplest way (far from ergonomic) would be to make multiple models, one on each device, and then manually send the input tensors to each device. For data parallelism, this means you would be responsible for dividing up the inputs, distributing them, and running them through. This way for any operation, if the tensors are on two different devices, it would fail to compile (this is where const device nums help).

However this is a really inflexible way to go, so it might make sense to have abstractions on top of this.

coreylowman commented 1 year ago

Ah yeah that would be torch.nn.DataParallel right?

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

vs torch.nn.parallel.DistributedDataParallel

jafioti commented 1 year ago

Both of them do DataParallel in the way I was talking about, where they distribute data and replicate models on each device, and then aggregate gradients. I think the only difference between DataParallel and DistributedDataParallel is that DataParallel uses threads whereas DistributedDataParallel uses multiprocessing. It is stated that

The difference between DistributedDataParallel and DataParallel is: DistributedDataParallel uses multiprocessing where a process is created for each GPU, while DataParallel uses multithreading. By using multiprocessing, each GPU has its dedicated process, this avoids the performance overhead caused by GIL of Python interpreter.

Obviously in rust there is no GIL, so no reason to use multiprocessing.

coreylowman commented 1 year ago

Oh. For some reason I thought they were more different than that, that's good to know. I hope we never have to launch with multiple processes 🤞

Just adding more resources to discussion on model parallel: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

It seems like data parallel more easily utilizes all resources (as mentioned in model parallel, pipelining can help, but is hard to tune correctly).

DistributedDataParallel distributes batches across different devices right? A Distributed wrapper around modules could actually be useful here:

struct Distributed<M, D> {
    pub models: Vec<M>,
    pub devices: Vec<D>
}

and then maybe with some clever impl Modules we can just split the batch into the devices?

jafioti commented 1 year ago

A really good article by Microsoft demonstrates the complexity of pipeline parallelism: https://www.microsoft.com/en-us/research/blog/pipedream-a-more-effective-way-to-train-deep-neural-networks-using-pipeline-parallelism/

This is why I think in the near term it makes sense to target data parallelism, which will sidestep most of that complexity.

coreylowman commented 1 year ago

Another thing I'd like to get with this is multi threaded CPU device. Related to #186. This is how we'll make DataParallel<D: Device> make sense across Cuda & CPU.

DataParallel<M, Cuda> will run model M in multiple threads, where each thread is running with a different Cuda device
DataParallel<M, Cpu> will run model M in multiple threads on the CPU.

coreylowman commented 1 year ago

Other things:

I think Gradients need to be all reduced on each device (broadcasted to each device, then each device will sum them).
How does this work with optimizers? Is there one optimizer for one of the devices and then that model state is broadcasted to the devices? Or are there N optimizers (1 for each device), that use the all reduced gradients to make identical updates?

coreylowman commented 1 year ago

A bit of important info is that apparently cuda driver has a slow locking mechanism when doing multi-threading. See https://github.com/coreylowman/cudarc/issues/169 for more info. This may force us to do multi-processing.

We should benchmark this before making any decisions

coreylowman / dfdx

Multi-GPU Support #595