facebookresearch / fairscale

PyTorch extensions for high performance and large scale training.
Other
3.17k stars 277 forks source link

[FSDP] Can not resume the optimizer with params in mulitple groups #601

Open chunfuchen opened 3 years ago

chunfuchen commented 3 years ago

❓ Questions and Help

I tried to follow the tutorial to change my codes to use FSDP; however, I do not know how to resume the training properly. Every time I resume, it seems to restart from scratch.

In order to resume, there are three state_dicts,

  1. model weights
  2. optimizer
  3. learning rate scheduler

I checked that 1 and 3 are correctly after I loaded them in the beginning of the resuming; however, it seems that something wrong with the load_state_dict for the optimizer.

I simply use optimizer.state_dict() to get the state_dict and save it to disk and optimizer.load_state_dict(state_dict) to recover the state dict, is there anything wrong to resume the state dict of optimizer?

I am not sure is it related to #538 and #539 ?

Thanks for your excellent work and help.

Note: I am using AdamW from the pytorch with multiple parameter groups since each group requires different weight decay.

sshleifer commented 3 years ago

Definitely related!

Could I see how you are initializing FSDP and the optimizer? Bonus points if you have a small reproducible snippet!

sshleifer commented 3 years ago

I'm pretty much out of cycles until June 1, sadly, so feel free to grab this!

min-xu-ai commented 3 years ago

@chunfuchen, sorry that many of us are busy with other issues. I am happy to help diagnose the issue here if I can. Do you know for sure it is optimizer state's problem? What do you mean by the training starts from scratch? Is its loss value go back or something else?

If possible, can you share some minimal reproducible case so that we can see exactly what's the problem?

BTW, what's your reason for using FSDP?

anj-s commented 3 years ago

@chunfuchen Did you get a chance to figure out the issue? Would like to understand more and help if possible. Thanks!