What does this PR do?

The list of saved_step_checkpoints contains TrainingProgress objects which are references to the TrainingProgress objects, which are updated during training. So all elements in the list will be the same, and will correspond to the current progress. Thus, in cases where k > 0, only the first k checkpoints are saved, and all remaining checkpoints are created and immediately deleted (since the checkpoints_to_delete is the same as the most recently saved one).

Solution: clone TrainingProgress object when saving the list of saved checkpoints.

General Changes

fixed as above
added an assert in the test to check for this case

Breaking Changes

none

Checklist before submitting final PR

[x] My PR is minimal and addresses one issue in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[ ] I have run a sample config for model training
[x] I have checked that all tests run through (python tests/tests.py) (here)
[ ] I have updated the internal changelog (CHANGELOG_DEV.md)

Modalities / modalities

fix: clone TrainingProgress when saving list of saved checkpoints #268

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR