ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
24 stars 3 forks source link

Fix fast distributed checkpoint loading #30

Closed jlamypoirier closed 1 day ago

jlamypoirier commented 1 day ago

✨ Description

Distributed checkpoint loading failed because workers disagreed on the loading method. Fix that, make checkpoint loading a bit less verbose, and add some safety barriers.

🔍 Type of change

Select all that apply: