Checkpoint Command Line Argument Refactor

nprisbrey commented 6 months ago

The code changes changes the --checkpoint-activations command line argument into the --checkpoint argument. The meaning of these two are different as well. --checkpoint-activations originally told the torchscale framework to use FairScale to do checkpoint activation saving, which didn't clearly add a visible benefit to our project and wasn't compatible with torch.compile() functionality. For this reason, we mainly avoided using this flag.

--checkpoints now refers to if the model should use our custom checkpointing save code for the model weights while training or not. This retains the capability of saving different states of the model throughout training while increasing interpretability and allowing us to continue reaping the benefits of the torch.compile() speedup.

What are everyone's thoughts about this change of functionality? Is this desirable to others or only to me? (Specifically looking to @KimballNJardine for his input, seeing as we discussed this a bit earlier in relation to #8)

nprisbrey commented 6 months ago

@KimballNJardine, we need to keep the permissions at least with executable privileges for the user for both of the job scripts. The changes introduced in ef1f54a prevent the job scripts from being executable, which defeats the point of a script.

What kind of issues are you getting when these scripts are executable? Can they be resolved locally?

Summary: I feel that these changes in ef1f54a should be rolled back.

nprisbrey commented 6 months ago

We need to make sure that we save the command line arguments and the finished trained weights even if checkpoint is turned off, after all the training is over.

Well said, nice catch! Thanks for fixing the issue in ed02183. I simply touched up a comment that was missed and some trailing space in bf74682.

DRAGNLabs / 301r_retnet

Checkpoint Command Line Argument Refactor #10