FSDP training raise "KeyError: 'ShardingStrategy.NO_SHARD'"

lzl-mt commented 5 months ago

System Info

torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2

Information

[ ] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

Hi, i can train the asr_librispeech finetuning code use DDP, however, when i switch to FSDP, an exception raised.

Error logs

Traceback (most recent call last): File "examples/asr_librispeech/finetune_asr.py", line 41, in main_hydra train(kwargs) File "/SLAM-LLM/src/slam_llm/pipeline/finetune.py", line 167, in main model = FSDP( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in init _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap _recursive_wrap(auto_wrap_kwargs, fsdp_kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( [Previous line repeated 3 more times] File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap return wrapper_cls(module, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in init _init_param_handle_from_module( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module _init_param_handle_from_params(state, managed_params, fully_sharded_module) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 529, in _init_param_handle_from_params SHARDING_STRATEGY_MAP[state.sharding_strategy], KeyError: 'ShardingStrategy.NO_SHARD'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2031656) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError(

Expected behavior

How to modify to use FSDP for speeding up? Thanks a lot! :D

ddlBoJack commented 5 months ago

Have you solved it?

vincentge commented 4 months ago

Also run into the same issue, tried to modify: sharding_strategy: str = "NO_SHARD" to sharding_strategy: str = "FULL_SHARD" raised error message : "KeyError: 'ShardingStrategy.FULL_SHARD'"

Any idea how to solve it ?

PigeonDan1 commented 4 months ago

Maybe you guys can try to comment this line in finetune.py: sharding_strategy=fsdp_config.sharding_strategy The reason might be an extra blanket or some operators when we transfer the parameters. And the defualt value for sharding_strategy is actually FULL_SHARD But I am not sure whether it will keep well in training since I just trained the loss to nan.

rookie0607 commented 3 months ago

Any progress on that?

X-LANCE / SLAM-LLM