broadcast_tensors get stuck

tjulyz commented 3 years ago

Hi, I met the following problem recently:

W /tmp/pip-install-fzrlm1c4/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were sub mitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock

It happed when broadcast the model using train_vpa. I run the code on two gpus. Do you know how to solve it? Thanks

La-SilverLand commented 2 years ago

@tjulyz @ChenRocks hi, i've encountered the same problem recently.

env: the provided docker image of this repo, using 2 V100 on the same node command: horovodrun -np 2 python train_nlvr2.py --config config/train-nlvr2-base-1gpu.json

the output is as below W horovod/common/operations.cc:779] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. [1,0]:Stalled ranks: [1,0]:1: [broadcast.noname.1]

@tjulyz have you solved this problem @ChenRocks @linjieli222 do you have any hint

after changing the code below, the code still gets stuck ---original code---- model.init_type_embedding() model.to(device) broadcast_tensors([p.data for p in model.parameters()], 0)

---modified code--- model.init_type_embedding() model.to(device) if rank==0: broadcast_tensors([p.data for p in model.parameters()], 0)

La-SilverLand commented 2 years ago

I've solved the problem the stalling problem happens in the loading data step file lock for reading files is needed for horovod multiprocessing to function well


def main(opts):
    hvd.init()
    n_gpu = hvd.size()
    device = torch.device("cuda", hvd.local_rank())
    torch.cuda.set_device(hvd.local_rank())
    rank = hvd.rank()
    opts.rank = rank
    LOGGER.info("device: {} n_gpu: {}, rank: {}, "
                "16-bits training: {}".format(
                    device, n_gpu, hvd.rank(), opts.fp16))

    if opts.gradient_accumulation_steps < 1:
        raise ValueError("Invalid gradient_accumulation_steps parameter: {}, "
                         "should be >= 1".format(
                            opts.gradient_accumulation_steps))

    set_random_seed(opts.seed)

    # train_examples = None
    LOGGER.info(f"Rank {rank} Loading Train Dataset {opts.train_txt_db}, "
                f"{opts.train_img_db}")
    if 'mami' in opts.model:
        DatasetCls=MamiDataset
        EvalDatasetCls = MamiEvalDataset
        collate_fn = mami_collate
        eval_collate_fn=mami_eval_collate
        if opts.model=='mami':
            ModelCls=UniterForMami
    else:
        raise ValueError('unrecognized model type')

    # data loaders
    with FileLock(os.path.expanduser("~/.horovod_lock_train")):
        train_dataloader = create_dataloader(opts.train_img_db, opts.train_txt_db,
                                         opts.train_batch_size, True,
                                         DatasetCls, collate_fn, opts)

    with FileLock(os.path.expanduser("~/.horovod_lock_val")):
        val_dataloader = create_dataloader(opts.val_img_db, opts.val_txt_db,
                                       opts.val_batch_size, False,
                                       EvalDatasetCls, eval_collate_fn, opts)

    with FileLock(os.path.expanduser("~/.horovod_lock_test")):
        test_dataloader = create_dataloader(opts.test_img_db, opts.test_txt_db,
                                        opts.val_batch_size, False,
                                        EvalDatasetCls, eval_collate_fn, opts)
``

ChenRocks / UNITER

broadcast_tensors get stuck #50