Open tjulyz opened 4 years ago
@tjulyz @ChenRocks hi, i've encountered the same problem recently.
env: the provided docker image of this repo, using 2 V100 on the same node command: horovodrun -np 2 python train_nlvr2.py --config config/train-nlvr2-base-1gpu.json
the output is as below
W horovod/common/operations.cc:779] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]
@tjulyz have you solved this problem @ChenRocks @linjieli222 do you have any hint
after changing the code below, the code still gets stuck
---original code----
model.init_type_embedding() model.to(device) broadcast_tensors([p.data for p in model.parameters()], 0)
---modified code---
model.init_type_embedding() model.to(device) if rank==0: broadcast_tensors([p.data for p in model.parameters()], 0)
I've solved the problem the stalling problem happens in the loading data step file lock for reading files is needed for horovod multiprocessing to function well
def main(opts):
hvd.init()
n_gpu = hvd.size()
device = torch.device("cuda", hvd.local_rank())
torch.cuda.set_device(hvd.local_rank())
rank = hvd.rank()
opts.rank = rank
LOGGER.info("device: {} n_gpu: {}, rank: {}, "
"16-bits training: {}".format(
device, n_gpu, hvd.rank(), opts.fp16))
if opts.gradient_accumulation_steps < 1:
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, "
"should be >= 1".format(
opts.gradient_accumulation_steps))
set_random_seed(opts.seed)
# train_examples = None
LOGGER.info(f"Rank {rank} Loading Train Dataset {opts.train_txt_db}, "
f"{opts.train_img_db}")
if 'mami' in opts.model:
DatasetCls=MamiDataset
EvalDatasetCls = MamiEvalDataset
collate_fn = mami_collate
eval_collate_fn=mami_eval_collate
if opts.model=='mami':
ModelCls=UniterForMami
else:
raise ValueError('unrecognized model type')
# data loaders
with FileLock(os.path.expanduser("~/.horovod_lock_train")):
train_dataloader = create_dataloader(opts.train_img_db, opts.train_txt_db,
opts.train_batch_size, True,
DatasetCls, collate_fn, opts)
with FileLock(os.path.expanduser("~/.horovod_lock_val")):
val_dataloader = create_dataloader(opts.val_img_db, opts.val_txt_db,
opts.val_batch_size, False,
EvalDatasetCls, eval_collate_fn, opts)
with FileLock(os.path.expanduser("~/.horovod_lock_test")):
test_dataloader = create_dataloader(opts.test_img_db, opts.test_txt_db,
opts.val_batch_size, False,
EvalDatasetCls, eval_collate_fn, opts)
``
Hi, I met the following problem recently:
W /tmp/pip-install-fzrlm1c4/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were sub mitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock
It happed when broadcast the model using train_vpa. I run the code on two gpus. Do you know how to solve it? Thanks