✨ Description

Rework the state_dict checkpoint format. Old checkpoints should still be loadable for now.

Rename to fast_llm, to stress that it's the standard Fast-LLM checkpointing format.
Replace state_dict.safetensors.index.json (copied from Hugginface format) with a cleaner metadata.yaml that matches the distributed format.
Rename state_dict -> model for safetensors files (was state_dict to make the difference with HF format more obvious, but now the metadata.yaml makes it clear enough)

Change the checkpoint directory structure. Backward compatible in fast-llm, but not for outside uses:

export/ -> export/format/
checkpoints -> checkpoint

This should be it for checkpoints for now. There are more things to do (#26, async checkpoints, polish interfaces, tests, etc.), but I spent enough time on checkpoints.

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[x] 🚀 New feature (non-breaking change that adds functionality)
[x] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[x] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

ServiceNow / Fast-LLM

New long-term checkpoint format #33

✨ Description

🔍 Type of change