coreweave / kubernetes-cloud

Getting Started with the CoreWeave Kubernetes GPU Cloud
http://www.coreweave.com
68 stars 46 forks source link

Investigate Deepspeed/HuggingFace slowness in finetuner #171

Open wbrown opened 1 year ago

wbrown commented 1 year ago

DeepSpeed and Huggingface appears to be slowing training down significantly. We should investigate why -- it may be the optimizer states.

wbrown commented 1 year ago

@harubaru Where are we with this? Did the performance reporting you do have any insight?

harubaru commented 1 year ago

The investigating that I have done has mainly revolved around using different ZeRO stages and trying out different hyperparameters. Different optimizers could not be used due to lacking a proper NCCL dependency in the base Torch image for the trainer, but besides that, here are some of the things that can most definitely improve training speed is:

These are also the changeable factors (meaning: variables that can be adjusted through the workflow) that affect the training speed:

  1. Training performance is less affected by the ZeRO stage used and is more affected by the amount of Gradient Accumulation Steps.
  2. The amount of Gradient Accumulation Steps. Higher GAS means higher throughput.
  3. Higher batch size. This also increases overall throughput.
Run Name GAS Time (s) OPT Time (s) World Samples per Second Rank Samples per Second Total Time per Step (s)
zero_stage_3 58.096 25.079 0.4688 0.4688 83.892
zero_stage_2 51.642 25.455 0.5082 0.5082 78.514
zero_stage_1 51.863 27.691 0.5018 0.5018 79.112
gas-5* 68.412 23.677 0.4384 0.2209 91.512
gas-2* 17.238 23.498 0.3899 0.1976 40.524

* The runs for testing the timings for GAS uses two GPUs instead of one. They also use a different version of the finetuner that is currently being tested in https://github.com/coreweave/kubernetes-cloud/pull/128, so those tests have to be reran but the above recommendations for improving training speed should remain the same regardless.

For future work, we should definitely look into seeing if we can use a different optimizer as CPU AdamW has a ridiculously high amount of performance overhead. There are also possible methods that we could try out such as incorporating flash-attention and using fused kernels for the optimizers which would decrease memory usage further, however the former of which requires a lot of monkey patching, and the latter of which would need more investigating as DeepSpeed does support fused Adam out of the box.

wbrown commented 1 year ago

Marking this as done, as investigation is complete. Write an issue for using a different optimizer?