Open DrewGalbraith opened 7 months ago
This is solved in https://github.com/DRAGNLabs/301r_retnet/pull/60
This was only half-solved in #60, leading to the checkpoints like the following:
I propose adding the following line to CustomModelCheckpoint
's on_checkpoint()
function to actually update the object's filename each time:
self.file_name = self.file_name.replace(f"{self.num_ckpts-1}", f"{self.num_ckpts}", 1)
And it would be nice to have the naming convention use format "000_epoch...
, 001_epoch...
, 002_epoch...
, ...
" so as not to group ckpts 1 and 11 together, for example.
To do this, replace the above proposed line with the following:
len_num = 3 if self.num_ckpts < 1000 else 4
self.file_name = self.file_name.replace(f"{self.num_ckpts-1:0>{len_num}}", f"{self.num_ckpts:0>{len_num}}", 1)
In
train_model.py
we have an issue where checkpoints saved from the same epoch will likely overwrite one another. It looks like line 137 can change this behavior.filename="epoch_{epoch}_validation_{val_loss:.2f}"
Maybe we can include step count in the title to differentiate meaningfully, or time since start of the training run.