Open azton opened 1 year ago
Have functional implementation of deepspeed using zero3. Now need to analyze the performance and scalability--do we need pipeline/model parallelism to really run this?
First trial; 32 Nodes increasing batch size until cuda OOM failures. Requires roughly 1-2 hours to run 3-5 steps for profiling.
Have functional implementation of deepspeed using zero3. Now need to analyze the performance and scalability--do we need pipeline/model parallelism to really run this?