Open KookHoiKim opened 1 month ago
I am working with nvcr.io/nvidia/pytorch:24.07-py3 image. It installed torch==2.4.0 . When comment out the code L.256~257 in initialize.py, initialization does not stuck anymore.
# Call the init process
init_process_group_kwargs = {
'backend' : args.distributed_backend,
'world_size': args.world_size,
'rank': args.rank,
'timeout': timedelta(minutes=args.distributed_timeout_minutes),
}
# if packaging.version.Version(torch.__version__) >= packaging.version.Version("2.3.0"):
# init_process_group_kwargs['device_id'] = device_id
Describe the bug I am currently working with llava model in megatron. I tested tensor parallel and it works well. However, when i set pipeline parallel, it stucks while initialization. I found that in initialize_model_parallel ,
group_gloo = torch.distributed.new_group(ranks, backend="gloo"
is not passed for rank1 gpu. I am using 2 A100 gpus , so I set TP=1 . If anyone have any idea, please help. Thanks.FYI, i add NCCL_DEBUG info .