Checkpoints from the same epoch will overwrite one another

DRAGNLabs / 301r_retnet

2 stars 1 forks source link

Checkpoints from the same epoch will overwrite one another #50

Open DrewGalbraith opened 7 months ago

DrewGalbraith commented 7 months ago

In train_model.py we have an issue where checkpoints saved from the same epoch will likely overwrite one another. It looks like line 137 can change this behavior. filename="epoch_{epoch}_validation_{val_loss:.2f}" Maybe we can include step count in the title to differentiate meaningfully, or time since start of the training run.

DrewGalbraith commented 7 months ago

This is solved in https://github.com/DRAGNLabs/301r_retnet/pull/60

DrewGalbraith commented 7 months ago

This was only half-solved in #60, leading to the checkpoints like the following:

I propose adding the following line to CustomModelCheckpoint's on_checkpoint() function to actually update the object's filename each time:

self.file_name = self.file_name.replace(f"{self.num_ckpts-1}", f"{self.num_ckpts}", 1)

DrewGalbraith commented 7 months ago

And it would be nice to have the naming convention use format "000_epoch..., 001_epoch..., 002_epoch..., ..." so as not to group ckpts 1 and 11 together, for example.

DrewGalbraith commented 7 months ago

To do this, replace the above proposed line with the following:

len_num = 3 if self.num_ckpts < 1000 else 4
self.file_name = self.file_name.replace(f"{self.num_ckpts-1:0>{len_num}}", f"{self.num_ckpts:0>{len_num}}", 1)