ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
562 stars 40 forks source link

save checkpoint with error #50

Open 13613979212 opened 2 months ago

13613979212 commented 2 months ago

set max_steps=500, save_steps=100 When it reaches step 100, the checkpoint is saved successfully but nccl_timeout is displayed

Muennighoff commented 2 months ago

i think that is the first issue mentioned here: https://github.com/ContextualAI/gritlm?tab=readme-ov-file#known-issues - don't have a better solution atm than what is said there