Avoiding loading a model twice potentially from two different checkpoints.
Motivation and Context
This might be a corner a case but an unwanted model loading happens when all these three conditions are met:
running under slurm (given the the --preemptable flag is set)
loading a model from a checkpoint (given --load_from_checkpoint flag is set)
a --checkpoint_dir is set AND there are some additional existing checkpoints in there. If the latest checkpoint in such folder is different from the one passed through --load_from_checkpoint, the former will be overwritten. I expect this to be an unwanted behaviour.
Description
Avoiding loading a model twice potentially from two different checkpoints.
Motivation and Context
This might be a corner a case but an unwanted model loading happens when all these three conditions are met:
--preemptable
flag is set)--load_from_checkpoint
flag is set)--checkpoint_dir
is set AND there are some additional existing checkpoints in there. If the latest checkpoint in such folder is different from the one passed through--load_from_checkpoint
, the former will be overwritten. I expect this to be an unwanted behaviour.If a train is launched NOT under slurm but with the
--preemptable
flag is set the overwriting will not happen since EGG will assign a unique name to the checkpoint folder without any existing checkpoints https://github.com/facebookresearch/EGG/blob/5a68e295c31342385d024ecd5cbff0ff69ee69b0/egg/core/distributed.py#L107-L112 and loading from latest will not do anything https://github.com/facebookresearch/EGG/blob/5a68e295c31342385d024ecd5cbff0ff69ee69b0/egg/core/trainers.py#L307-L316However, launching EGG not under SLURM and with
--preemptable
set is pointlessHow Has This Been Tested?
UTs