Investigate Deepspeed/HuggingFace slowness in finetuner

wbrown commented 1 year ago

DeepSpeed and Huggingface appears to be slowing training down significantly. We should investigate why -- it may be the optimizer states.

wbrown commented 1 year ago

@harubaru Where are we with this? Did the performance reporting you do have any insight?

harubaru commented 1 year ago

The investigating that I have done has mainly revolved around using different ZeRO stages and trying out different hyperparameters. Different optimizers could not be used due to lacking a proper NCCL dependency in the base Torch image for the trainer, but besides that, here are some of the things that can most definitely improve training speed is:

Using a different optimizer. Currently, we use DeepSpeed's CPU AdamW optimizer which means that all of the optimizer states are stored in Float32. This amount of precision is not that necessary and if we are able to shrink this via 8-bit quantization, we can definitely speed up each optimization step. We also support 8-bit AdamW in the Stable Diffusion finetuning example, but it would require a bit of effort to port this to be usable by DeepSpeed.
Increase Gradient Accumulation Steps (aka GAS). The effect of increasing GAS is to emulate training at a higher batch size. Of course, this increases the time per step but the samples per second also increases, therefore throughput increases with higher GAS (see perf table)
Sacrificing offloading for performance. ZeRO stages such as stage 3 offloads both the model parameters and the optimizer state (which is already running on CPU) to CPU RAM to save VRAM. If we are able to somehow fit both the parameter states and the optimizer states into VRAM, we would completely eliminate the bottleneck of communication between the GPUs and system RAM. The only downside to this of course is that we would no longer be able to finetune large models affordably.

These are also the changeable factors (meaning: variables that can be adjusted through the workflow) that affect the training speed:

Training performance is less affected by the ZeRO stage used and is more affected by the amount of Gradient Accumulation Steps.
The amount of Gradient Accumulation Steps. Higher GAS means higher throughput.
Higher batch size. This also increases overall throughput.

Run Name	GAS Time (s)	OPT Time (s)	World Samples per Second	Rank Samples per Second	Total Time per Step (s)
zero_stage_3	58.096	25.079	0.4688	0.4688	83.892
zero_stage_2	51.642	25.455	0.5082	0.5082	78.514
zero_stage_1	51.863	27.691	0.5018	0.5018	79.112
gas-5*	68.412	23.677	0.4384	0.2209	91.512
gas-2*	17.238	23.498	0.3899	0.1976	40.524

* The runs for testing the timings for GAS uses two GPUs instead of one. They also use a different version of the finetuner that is currently being tested in https://github.com/coreweave/kubernetes-cloud/pull/128, so those tests have to be reran but the above recommendations for improving training speed should remain the same regardless.

For future work, we should definitely look into seeing if we can use a different optimizer as CPU AdamW has a ridiculously high amount of performance overhead. There are also possible methods that we could try out such as incorporating flash-attention and using fused kernels for the optimizers which would decrease memory usage further, however the former of which requires a lot of monkey patching, and the latter of which would need more investigating as DeepSpeed does support fused Adam out of the box.

wbrown commented 1 year ago

Marking this as done, as investigation is complete. Write an issue for using a different optimizer?

coreweave / kubernetes-cloud

Investigate Deepspeed/HuggingFace slowness in finetuner #171