SparkJiao / llama-pipeline-parallel

A prototype repo for hybrid training of pipeline parallel and distributed data parallel with comments on core code snippets. Feel free to copy code and launch discussions about the problems you have encoured.
45 stars 2 forks source link

pipeline with zero1, does not optimize gpu meomery #4

Closed xinpeng-zhang closed 1 year ago

xinpeng-zhang commented 1 year ago

when I set zero to stage 0, the gpu memory is similar with setting zero to stage 1

SparkJiao commented 1 year ago

To my understanding, pure pipeline parallelism has already include partition mechanism like zero-2, where the optimizer in each stage only control the updates of its own partition.

If you want further optimization, it would be observed when you enable hybrid parallel, e.g., 8 GPU with 4-route pipeline parallel and 2-route data parallel.

xinpeng-zhang commented 1 year ago

yes, i have done that. And i find 4 PP 2 DP is same as 8 PP 1 DP. Is there any chance that I have you on wechat: bestzxp

SparkJiao commented 1 year ago

Maybe you can check if you have enabled cpu offload.