facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

How does pytorch handle backward pass in a multi-GPU setting? #353

Closed christindbose closed 11 months ago

christindbose commented 1 year ago

Past works have often proposed remote GPU caching as a performance optimization. As an example, if data x originally stored on GPU0 is requested by GPU1, then x is cached in GPU1's L1 or L2 cache (there are pros and cons if it's cached in L1 or L2 depending on the workload).

While applying remote caching is trivial for inference, getting it to work for training seems to be more challenging. Primarily because we need to maintain coherence across all cached copies after the backward pass. I was curious to know how Pytorch actually handles all this under the hood. I've read that the gradients are calculated based on a computational graph.

Some questions I have:

Any pointers to find the answers to these questions would be great. Thanks!

mnaumovfb commented 1 year ago

I think that this is a more general question that is not specific to DLRM. It will probably be more holistically answered on Pytorch forums and I would suggest just reposting it there.

christindbose commented 1 year ago

ok, will do. Thanks @mnaumovfb