chufanchen / read-paper-and-code

0 stars 0 forks source link

MLSyS 2023 | PyTorch RPC: Distributed Deep Learning Built on Tensor-Optimized Remote Procedure Calls #120

Closed chufanchen closed 6 months ago

chufanchen commented 6 months ago

https://proceedings.mlsys.org/paper_files/paper/2023/hash/b95b58ff6d46d4b7ef2b3e2fd0ddb24c-Abstract-mlsys2023.html

https://github.com/pytorch/tensorpipe

chufanchen commented 6 months ago

Background

Rigorous constraints on model structures or training paradigms

chufanchen commented 6 months ago

Programming Interface

Remote Execution: rpc.remote()

Remote Reference: RRef, to_here()

Distributed Autograd: with autograd.context() as ctx

Tensor-Aware Communication

Non-Tensor $\rightarrow$ packed into a binary payload and sent over the most performant CPU channel(e.g., TCP, SHM, etc.).

Tensor $\rightarrow$ out-of-band fabrics(NVLink, IB, TCP, etc.)

  1. the sender passes the binary payload and a list of Tensor objects to the communication layer
  2. the communication layer detects the optimal channel based on the device type of each Tensor
  3. the receiver learns the size and the device type of the Tensor, allocated Tensor storage accordingly
  4. transmit Tensor values through the chosen channel

Decouple the payload and Tensor objects overlap CPU computations, CUDA computations, CUDA memory allocations, and host-device communications, leading to highly efficient CUDA RPC.

Use persistent pinned staging buffers to enable non-blocking device-to-host(D2H) and host-to-device(H2D) communications.

image

HS: handshake

Memory Management

Background: we need to track the lifetime of activations. In local training, reference counting and garbage collection is suffice.

Remote Reference: distributed reference counting and garbage collection

Each RRef has a single owner and arbitrary number of users. The owner lived on the same process holding the data of the RRef and performs bookkeeping for all users. When sharing an RRef across processes, the RPC framework automatically generates control messages to update the reference count on the owner. {owner, user} $\times$ {caller, callee}

Distributed Autograd