Before, we were using Torchinfo to obtain the total parameters of the model before we begin training. With the addition of flash attention, it seems that this cause OOM errors on VRAM for very large models. This is likely a problem with Torchinfo. This replaces the Torchinfo code with simple PyTorch code that sums up all model parameters. Functionality remains the same.
Before, we were creating a unique model_label for each run, which includes the time stamp. This is nice, but when the job is requeued through Slurm it creates a new folder and doesn't find the checkpoint, thus restarting the training from the beginning. Instead, the code should reference the same model folder. To facilitate this, I have added a new configuration parameter, model_label, which the use can define as the name of the folder to save run info(checkpoints, logs). In my experience, this approach is much easier to use and provides the user with more control. If this isn't supplied, a label will be generated. However, this may begin using an already existing folder, which is the biggest downside. So the user needs to be careful.
Before, we were using Torchinfo to obtain the total parameters of the model before we begin training. With the addition of flash attention, it seems that this cause OOM errors on VRAM for very large models. This is likely a problem with Torchinfo. This replaces the Torchinfo code with simple PyTorch code that sums up all model parameters. Functionality remains the same.
Before, we were creating a unique
model_label
for each run, which includes the time stamp. This is nice, but when the job is requeued through Slurm it creates a new folder and doesn't find the checkpoint, thus restarting the training from the beginning. Instead, the code should reference the same model folder. To facilitate this, I have added a new configuration parameter,model_label
, which the use can define as the name of the folder to save run info(checkpoints, logs). In my experience, this approach is much easier to use and provides the user with more control. If this isn't supplied, a label will be generated. However, this may begin using an already existing folder, which is the biggest downside. So the user needs to be careful.