allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.79k stars 487 forks source link

Added ability to try loading the latest checkpoint from save folders #717

Closed 2015aroras closed 2 months ago

2015aroras commented 3 months ago

Issue: Our training runs never finish in 1 run of train.py. Currently we don't have a nice way to continue training using the same config; we have to set the load_path to the latest checkpoint.

Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming load_path is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save) is set to True for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.

UPDATE: Changed try_load_latest_save to override load_path. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.

2015aroras commented 3 months ago

Tested the 3 main scenarios: no existing checkpoint, only remote checkpoints, local + remote checkpoints.