axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.58k stars 822 forks source link

An Arbitrary Number of Checkpoints are Saved #1520

Open Peter-Devine opened 5 months ago

Peter-Devine commented 5 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

When I run a training run and I do not specify the number of checkpoints to save in save_total_limit, I expected the Axolotl code to save all checkpoints.

Current behaviour

However, according to this code:

https://github.com/OpenAccess-AI-Collective/axolotl/blob/132eb740f036eff0fa8b239ddaf0b7a359ed1732/src/axolotl/core/trainer_builder.py#L1168C22-L1168C38

the number of checkpoints defaults to 4. This seems arbitrary to me.

Steps to reproduce

Run the training code without setting save_total_limit explicitly.

Config yaml

No response

Possible solution

This (in my opinion) is not well documented, so I'd like either some more documentation, or have the default to save all checkpoints.

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

NanoCode012 commented 5 months ago

Hm, this is a good point. Maybe we should default to None instead like in HF: https://huggingface.co/docs/transformers/v4.40.0/en/main_classes/trainer#transformers.TrainingArguments.save_total_limit