Open wbrown opened 1 year ago
@harubaru Where are we with this? Did the performance reporting you do have any insight?
The investigating that I have done has mainly revolved around using different ZeRO stages and trying out different hyperparameters. Different optimizers could not be used due to lacking a proper NCCL dependency in the base Torch image for the trainer, but besides that, here are some of the things that can most definitely improve training speed is:
These are also the changeable factors (meaning: variables that can be adjusted through the workflow) that affect the training speed:
Run Name | GAS Time (s) | OPT Time (s) | World Samples per Second | Rank Samples per Second | Total Time per Step (s) |
---|---|---|---|---|---|
zero_stage_3 | 58.096 | 25.079 | 0.4688 | 0.4688 | 83.892 |
zero_stage_2 | 51.642 | 25.455 | 0.5082 | 0.5082 | 78.514 |
zero_stage_1 | 51.863 | 27.691 | 0.5018 | 0.5018 | 79.112 |
gas-5* | 68.412 | 23.677 | 0.4384 | 0.2209 | 91.512 |
gas-2* | 17.238 | 23.498 | 0.3899 | 0.1976 | 40.524 |
* The runs for testing the timings for GAS uses two GPUs instead of one. They also use a different version of the finetuner that is currently being tested in https://github.com/coreweave/kubernetes-cloud/pull/128, so those tests have to be reran but the above recommendations for improving training speed should remain the same regardless.
For future work, we should definitely look into seeing if we can use a different optimizer as CPU AdamW has a ridiculously high amount of performance overhead. There are also possible methods that we could try out such as incorporating flash-attention and using fused kernels for the optimizers which would decrease memory usage further, however the former of which requires a lot of monkey patching, and the latter of which would need more investigating as DeepSpeed does support fused Adam out of the box.
Marking this as done, as investigation is complete. Write an issue for using a different optimizer?
DeepSpeed and Huggingface appears to be slowing training down significantly. We should investigate why -- it may be the optimizer states.