Closed chufanchen closed 6 months ago
Rigorous constraints on model structures or training paradigms
Remote Execution: rpc.remote()
Remote Reference: RRef
, to_here()
Distributed Autograd: with autograd.context() as ctx
Non-Tensor $\rightarrow$ packed into a binary payload and sent over the most performant CPU channel(e.g., TCP, SHM, etc.).
Tensor $\rightarrow$ out-of-band fabrics(NVLink, IB, TCP, etc.)
Decouple the payload and Tensor objects overlap CPU computations, CUDA computations, CUDA memory allocations, and host-device communications, leading to highly efficient CUDA RPC.
Use persistent pinned staging buffers to enable non-blocking device-to-host(D2H) and host-to-device(H2D) communications.
HS: handshake
Background: we need to track the lifetime of activations. In local training, reference counting and garbage collection is suffice.
Remote Reference: distributed reference counting and garbage collection
Each RRef has a single owner and arbitrary number of users. The owner lived on the same process holding the data of the RRef and performs bookkeeping for all users. When sharing an RRef across processes, the RPC framework automatically generates control messages to update the reference count on the owner. {owner, user} $\times$ {caller, callee}
https://proceedings.mlsys.org/paper_files/paper/2023/hash/b95b58ff6d46d4b7ef2b3e2fd0ddb24c-Abstract-mlsys2023.html
https://github.com/pytorch/tensorpipe