ORNL / HydraGNN

Distributed PyTorch implementation of multi-headed graph convolutional neural networks
BSD 3-Clause "New" or "Revised" License
68 stars 29 forks source link

Checkpoint Training Model #187

Closed JustinBakerMath closed 1 year ago

JustinBakerMath commented 1 year ago

Basic checkpointing for training based on the synchronized validation_loss performance metric.

Can be turned on by adding

"Training" : {
    "Checkpoint" : true,
}

with additional checkpoint_freq argument for the frequency of checkpointing (default=10)

"Training" : {
    "Checkpoint" : true,
    "checkpoint_freq" : 10,
}
JustinBakerMath commented 1 year ago

After revisiting this: checkpoint_freq and min_perf_metric aren't ideally compatible.

If the goal is to use min_perf_metric to checkpoint, then there should be a warmup rather than a frequency.

I've refactored this, replacing checkpoint_freq with checkpoint_warmup for a warmup to checkpointing.

Two print lines are also included

allaffa commented 1 year ago

@jychoi-hpc

Hi Jong. Justin implemented a checkpoint restart which saves a copy of the model when the validation loss function gets reduced at a new iteration. A warm-up waiting is applied at earlier iterations of the training to avoid saving too many models at the beginning. I was wondering if the PR may cause too many bottlenecks with significant slow-down for training. Any thoughts on this?