The list of saved_step_checkpoints contains TrainingProgress objects which are references to the TrainingProgress objects, which are updated during training. So all elements in the list will be the same, and will correspond to the current progress. Thus, in cases where k > 0, only the first k checkpoints are saved, and all remaining checkpoints are created and immediately deleted (since the checkpoints_to_delete is the same as the most recently saved one).
Solution: clone TrainingProgress object when saving the list of saved checkpoints.
General Changes
fixed as above
added an assert in the test to check for this case
Breaking Changes
none
Checklist before submitting final PR
[x] My PR is minimal and addresses one issue in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[ ] I have run a sample config for model training
[x] I have checked that all tests run through (python tests/tests.py) (here)
[ ] I have updated the internal changelog (CHANGELOG_DEV.md)
What does this PR do?
The list of
saved_step_checkpoints
contains TrainingProgress objects which are references to the TrainingProgress objects, which are updated during training. So all elements in the list will be the same, and will correspond to the current progress. Thus, in cases wherek > 0
, only the first k checkpoints are saved, and all remaining checkpoints are created and immediately deleted (since thecheckpoints_to_delete
is the same as the most recently saved one).Solution: clone TrainingProgress object when saving the list of saved checkpoints.
General Changes
Breaking Changes
none
Checklist before submitting final PR
python tests/tests.py
) (here)CHANGELOG_DEV.md
)