Checkpoints

Probably smart to incorporate checkpoints in our training. I've currently been running a process for 13 hours (probably overkill). As we're going to use multiple views later in the project, we can expect the training times to increase by a lot. Any processes running for over 24 hours will be automatically terminated by SLURM, if no checkpoints have been made, all of the work in the previous 24 hours is gone. Also the GPU clusters at UiS seem to have power failures every few months. Just nice to be prepared.

TASKS:

Automate checkpoint creation for every x (variable) epochs.
Automate restoring from a checkpoint.

Would especially recommend to look at resource number 2 as it's simple and is being used together with distributed training (or other resources that also look at distributed training). When implementing this expect distributed training (multiple GPUs) to be implemented in the near future. Also might be smart to split up from the cAE.py file when programming this and leaving cAE.py to (more or less) only contain the cAE model. Later the cAE.py file might be renamed to models.py and include other models such as VAE.

LinasVidziunas / Unsupervised-lesion-detection-with-multi-view-MRI-and-autoencoders

Checkpoints #16

Checkpoints

Resources

Status update as of 07/02/22