Open chufanchen opened 8 months ago
LLM: Memory-bounded
Tensor parallelism: increased memory bandwidth outweights communication overhead
Data parallelism, pipeline parallelism
Diffusion: Compute-bounded
Only data parallelism has been used for diffusion model serving
Our method introduces a new parallelization strategy called displaced patch parallelism, tailored to the sequential characteristics of diffusion model.
This computational demand escalates more than quadratically with increasing resolution.
$xt$ depends on $x{t-1}$, parallel computation of $\epsilont$ and $\epsilon{t-1}$ is challenging.
ParaDiGMS employing Picard iterations to parallelize the denoising steps in a data-parallel manner.
Tensor parallelism suffers from intolerable communication costs.
Displaced patch parallelism
Activation displacement
https://arxiv.org/abs/2402.19481
https://github.com/mit-han-lab/distrifuser