Issue: Our training runs never finish in 1 run of train.py. Currently we don't have a nice way to continue training using the same config; we have to set the load_path to the latest checkpoint.
Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming load_path is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save) is set to True for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.
UPDATE: Changed try_load_latest_save to override load_path. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.
Issue: Our training runs never finish in 1 run of
train.py
. Currently we don't have a nice way to continue training using the same config; we have to set theload_path
to the latest checkpoint.Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming
load_path
is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save
) is set toTrue
for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.UPDATE: Changed
try_load_latest_save
to overrideload_path
. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.