LinasVidziunas / Unsupervised-lesion-detection-with-multi-view-MRI-and-autoencoders

GNU General Public License v3.0
0 stars 0 forks source link

Checkpoints #16

Open LinasVidziunas opened 2 years ago

LinasVidziunas commented 2 years ago

Checkpoints

Probably smart to incorporate checkpoints in our training. I've currently been running a process for 13 hours (probably overkill). As we're going to use multiple views later in the project, we can expect the training times to increase by a lot. Any processes running for over 24 hours will be automatically terminated by SLURM, if no checkpoints have been made, all of the work in the previous 24 hours is gone. Also the GPU clusters at UiS seem to have power failures every few months. Just nice to be prepared.

TASKS:

Would especially recommend to look at resource number 2 as it's simple and is being used together with distributed training (or other resources that also look at distributed training). When implementing this expect distributed training (multiple GPUs) to be implemented in the near future. Also might be smart to split up from the cAE.py file when programming this and leaving cAE.py to (more or less) only contain the cAE model. Later the cAE.py file might be renamed to models.py and include other models such as VAE.

Resources

  1. https://keras.io/api/callbacks/model_checkpoint/
  2. https://keras.io/guides/distributed_training/#using-callbacks-to-ensure-fault-tolerance
  3. https://www.tensorflow.org/tutorials/distribute/keras#define_the_callbacks
  4. https://www.tensorflow.org/tutorials/distribute/custom_training#training_loop
LinasVidziunas commented 2 years ago

Status update as of 07/02/22

Changed priority of this issue: Priority 3 -> Priority 4, due to currently not being of any significant importance.