ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

Fix fast distributed checkpoint loading #30

Closed jlamypoirier closed 3 weeks ago

jlamypoirier commented 3 weeks ago

✨ Description

Distributed checkpoint loading failed because workers disagreed on the loading method. Fix that, make checkpoint loading a bit less verbose, and add some safety barriers.

🔍 Type of change

Select all that apply: