Check pointing and resuming from checkpoints. As we start to train lenses for extremely large models like LLAmma 30b and LLAmma 65b we should support saving our work-out to disk as we go so that if we have a crash all of our training progress will not be lost. In addition, this may allow us to run larger jobs that can take advantage of downtime on existing clusters by running at a low priority with preemption. In theory, it would also allow us to handle scale up and scale down events.
This would work by saving lens checkpoints optimizer and dataset states periodically to disk and supporting automatically loading said checkpoints when necessary.
This issue will probably only become relevant once we start training lenses on the very largest models or if we need to train on a cluster with time slicing.
Yep, I used to do this a long time ago and then realized it was taking up a lot of disk space so I stopped. If it's customizable then it should be fine
Check pointing and resuming from checkpoints. As we start to train lenses for extremely large models like LLAmma 30b and LLAmma 65b we should support saving our work-out to disk as we go so that if we have a crash all of our training progress will not be lost. In addition, this may allow us to run larger jobs that can take advantage of downtime on existing clusters by running at a low priority with preemption. In theory, it would also allow us to handle scale up and scale down events.
This would work by saving lens checkpoints optimizer and dataset states periodically to disk and supporting automatically loading said checkpoints when necessary.
This issue will probably only become relevant once we start training lenses on the very largest models or if we need to train on a cluster with time slicing.