Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.
MIT License
63 stars 8 forks source link

fix: clone TrainingProgress when saving list of saved checkpoints #268

Closed sthoduka closed 3 days ago

sthoduka commented 3 days ago

What does this PR do?

The list of saved_step_checkpoints contains TrainingProgress objects which are references to the TrainingProgress objects, which are updated during training. So all elements in the list will be the same, and will correspond to the current progress. Thus, in cases where k > 0, only the first k checkpoints are saved, and all remaining checkpoints are created and immediately deleted (since the checkpoints_to_delete is the same as the most recently saved one).

Solution: clone TrainingProgress object when saving the list of saved checkpoints.

General Changes

Breaking Changes

none

Checklist before submitting final PR