Open 13613979212 opened 2 months ago
set max_steps=500, save_steps=100 When it reaches step 100, the checkpoint is saved successfully but nccl_timeout is displayed
i think that is the first issue mentioned here: https://github.com/ContextualAI/gritlm?tab=readme-ov-file#known-issues - don't have a better solution atm than what is said there
set max_steps=500, save_steps=100 When it reaches step 100, the checkpoint is saved successfully but nccl_timeout is displayed