axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.87k stars 866 forks source link

Auto resume from checkpoint looks for "trainer_state.json", but no file is generated #1631

Open l3utterfly opened 5 months ago

l3utterfly commented 5 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Auto resume from checkpoint should work

Current behaviour

It is looking for: FileNotFoundError: [Errno 2] No such file or directory: 'outputs/out/checkpoint-9/trainer_state.json'

However this file is not generated

Steps to reproduce

  1. Start training for Llama3
  2. Wait for one checkpoint
  3. Stop and resume

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main-latest

Acknowledgements

winglian commented 5 months ago

@l3utterfly Can you paste the ls output for the contents of that directory?