Closed nprisbrey closed 6 months ago
@KimballNJardine, we need to keep the permissions at least with executable privileges for the user for both of the job scripts. The changes introduced in ef1f54a prevent the job scripts from being executable, which defeats the point of a script.
What kind of issues are you getting when these scripts are executable? Can they be resolved locally?
Summary: I feel that these changes in ef1f54a should be rolled back.
We need to make sure that we save the command line arguments and the finished trained weights even if checkpoint is turned off, after all the training is over.
Well said, nice catch! Thanks for fixing the issue in ed02183. I simply touched up a comment that was missed and some trailing space in bf74682.
The code changes changes the
--checkpoint-activations
command line argument into the--checkpoint
argument. The meaning of these two are different as well.--checkpoint-activations
originally told the torchscale framework to use FairScale to do checkpoint activation saving, which didn't clearly add a visible benefit to our project and wasn't compatible withtorch.compile()
functionality. For this reason, we mainly avoided using this flag.--checkpoints
now refers to if the model should use our custom checkpointing save code for the model weights while training or not. This retains the capability of saving different states of the model throughout training while increasing interpretability and allowing us to continue reaping the benefits of thetorch.compile()
speedup.What are everyone's thoughts about this change of functionality? Is this desirable to others or only to me? (Specifically looking to @KimballNJardine for his input, seeing as we discussed this a bit earlier in relation to #8)