Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.47k stars 3.3k forks source link

Refactor the DeepSpeed strategy config management #17472

Open carmocca opened 1 year ago

carmocca commented 1 year ago

Outline & Motivation

DeepSpeed works by using a configuration file (dictionary) that allows customizing all of its aspects: https://www.deepspeed.ai/docs/config-json/

The DeepSpeedStrategy supports two ways of defining this:

  1. Passing a config file, where every other argument becomes unused: https://github.com/Lightning-AI/lightning/blob/b792c90ea7148d61af192fde6c338ebbd355702f/src/lightning/fabric/strategies/deepspeed.py#L191
  2. Exposes multiple of these arguments in the __init__ that are used to define a base config. https://github.com/Lightning-AI/lightning/blob/b792c90ea7148d61af192fde6c338ebbd355702f/src/lightning/fabric/strategies/deepspeed.py#L242-L271

Option 2 is not scalable because:

Pitch

Remove all these exposed arguments and just have a config argument that overloads support for:

Where the default config is created by calling: https://github.com/microsoft/DeepSpeed/blob/085981bf1caf5d7d0b26d05f7c7e9487e1b35190/deepspeed/runtime/config.py#L674

Additional context

DeepSpeed is considered experimental so we could do this breaking change: https://github.com/Lightning-AI/lightning/blob/b792c90ea7148d61af192fde6c338ebbd355702f/src/lightning/fabric/strategies/deepspeed.py#L99

cc @justusschock @awaelchli @carmocca

keunwoochoi commented 5 months ago

+1 for this!