OLMO1.7 fine-tuning throws an error at the end of training, checkpoints are saved though for all epochs

nouhadziri commented 4 months ago

Here's the error stack trace:

2024-05-26T03:20:45.253031734Z   File "/gantry-runtime/open_instruct/finetune.py", line 891, in <module>
2024-05-26T03:20:45.253034168Z     main()
2024-05-26T03:20:45.253044166Z   File "/gantry-runtime/open_instruct/finetune.py", line 887, in main
2024-05-26T03:20:45.253045579Z     save_with_accelerate(accelerator, model, tokenizer, args.output_dir, args)
2024-05-26T03:20:45.253046847Z   File "/gantry-runtime/open_instruct/finetune.py", line 377, in save_with_accelerate
2024-05-26T03:20:45.253048170Z     state_dict = accelerator.get_state_dict(model)
2024-05-26T03:20:45.253049472Z   File "/opt/miniconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2922, in get_state_dict
2024-05-26T03:20:45.253050901Z     state_dict = model._zero3_consolidated_16bit_state_dict()
2024-05-26T03:20:45.253052150Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3541, in _zero3_consolidated_16bit_state_dict
2024-05-26T03:20:45.253053560Z     get_layer_state_dict(self.module, prefix="")
2024-05-26T03:20:45.253055058Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3534, in get_layer_state_dict
2024-05-26T03:20:45.253056535Z     get_layer_state_dict(child, prefix + name + ".")
2024-05-26T03:20:45.253057786Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3534, in get_layer_state_dict
2024-05-26T03:20:45.253059122Z     get_layer_state_dict(child, prefix + name + ".")
2024-05-26T03:20:45.253060363Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3534, in get_layer_state_dict
2024-05-26T03:20:45.253063056Z     get_layer_state_dict(child, prefix + name + ".")
2024-05-26T03:20:45.253064254Z   [Previous line repeated 2 more times]
2024-05-26T03:20:45.253065637Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3507, in get_layer_state_dict
2024-05-26T03:20:45.253066889Z     with deepspeed.zero.GatheredParameters(list(module.parameters(recurse=False)), modifier_rank=0):
2024-05-26T03:20:45.253068402Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2181, in __exit__
2024-05-26T03:20:45.253069709Z     handles = [dist.broadcast(p.data, self.src_rank, group=p.ds_process_group, async_op=True) for p in self.params]
2024-05-26T03:20:45.253071120Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2181, in <listcomp>
2024-05-26T03:20:45.253072554Z     handles = [dist.broadcast(p.data, self.src_rank, group=p.ds_process_group, async_op=True) for p in self.params]
2024-05-26T03:20:45.253074734Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
2024-05-26T03:20:45.253076054Z     return func(*args, **kwargs)
2024-05-26T03:20:45.253077339Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
2024-05-26T03:20:45.253078635Z     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
2024-05-26T03:20:45.253079989Z   File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast
2024-05-26T03:20:45.253083653Z     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
2024-05-26T03:20:45.253085029Z   File "/opt/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
2024-05-26T03:20:45.253086356Z     return func(*args, **kwargs)
2024-05-26T03:20:45.253087630Z   File "/opt/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
2024-05-26T03:20:45.253088940Z     work = default_pg.broadcast([tensor], opts)
2024-05-26T03:20:45.253090670Z RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106171200, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.
2024-05-26T03:20:47.549423060Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3016) of binary: /opt/miniconda3/bin/python

natolambert commented 4 months ago

Interesting. @nouhadziri I'm guessing this is one node, 8 gpus? are they a100s/h100s/other? Using my new open-instruct-olmo image?

nouhadziri commented 4 months ago

Yes, this is on one node using 8 GPUs (H100s) and I used your OLMO image nathanl/open_instruct_olmo

natolambert commented 4 months ago

Interesting. I didn't have this. We should try increasing the NCCL timeout via the accelerate arg (will look for it shortly)

natolambert commented 4 months ago

Try passing the arg timeout to 18000. Is in both DPO and fine-tune scripts.

nouhadziri commented 4 months ago

Will do in my next job, thanks for investigating this!

hamishivi commented 4 months ago

might be worth upping the timeout default. Currently it's set to the default in deepspeed itself iirc.

hamishivi commented 3 months ago

As part of #151 I tested finetuning OLMo and didn't hit this issue - please re-open or open a new issue if you see this pop up again!

allenai / open-instruct

OLMO1.7 fine-tuning throws an error at the end of training, checkpoints are saved though for all epochs #167