Open chunfuchen opened 3 years ago
Definitely related!
Could I see how you are initializing FSDP and the optimizer? Bonus points if you have a small reproducible snippet!
I'm pretty much out of cycles until June 1, sadly, so feel free to grab this!
@chunfuchen, sorry that many of us are busy with other issues. I am happy to help diagnose the issue here if I can. Do you know for sure it is optimizer state's problem? What do you mean by the training starts from scratch? Is its loss value go back or something else?
If possible, can you share some minimal reproducible case so that we can see exactly what's the problem?
BTW, what's your reason for using FSDP?
@chunfuchen Did you get a chance to figure out the issue? Would like to understand more and help if possible. Thanks!
❓ Questions and Help
I tried to follow the tutorial to change my codes to use FSDP; however, I do not know how to resume the training properly. Every time I resume, it seems to restart from scratch.
In order to resume, there are three state_dicts,
I checked that 1 and 3 are correctly after I loaded them in the beginning of the resuming; however, it seems that something wrong with the load_state_dict for the optimizer.
I simply use
optimizer.state_dict()
to get the state_dict and save it to disk andoptimizer.load_state_dict(state_dict)
to recover the state dict, is there anything wrong to resume the state dict of optimizer?I am not sure is it related to #538 and #539 ?
Thanks for your excellent work and help.
Note: I am using
AdamW
from the pytorch with multiple parameter groups since each group requires different weight decay.