facebookresearch / dpr-scale

Scalable training for dense retrieval models.
262 stars 25 forks source link

Error during embeddings generation #4

Open roddar92 opened 2 years ago

roddar92 commented 2 years ago

Dear colleagues, when I try to generate embeddings, I have an error:

Testing: 100%|████████████████████████████████████████████████████████████████████████████| 1851/1851 [02:41<00:00, 13.31it/s]
Writing tensor of size torch.Size([29606, 768]) to /root/dpr/ctx_embeddings/reps_0000.pkl
Error executing job with overrides: ['trainer.gpus=1', 'datamodule=generate', 'datamodule.test_path=/root/dpr/python_docs_w100.tsv', 'datamodule.test_batch_size=16', '+task.ctx_embeddings_dir=/root/dpr/ctx_embeddings', '+task.checkpoint_path=/root/dpr/trained_only_by_answers.ckpt', '+task.pretrained_checkpoint_path=/root/dpr/trained_only_by_answers.ckpt']
Traceback (most recent call last):
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 30, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 385, in _run_hydra
    run_and_report(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
    _ = r.return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 26, in main
    trainer.test(task, datamodule=datamodule)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 972, in __test_given_model
    results = self.fit(model)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
    self.dispatch()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in dispatch
    self.accelerator.start_testing(self)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
    self.training_type_plugin.start_testing(trainer)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
    self._results = trainer.run_test()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in run_test
    eval_loop_results, _ = self.run_evaluation()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation
    deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end
    deprecated_results = self.__run_eval_epoch_end(self.num_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 219, in __run_eval_epoch_end
    eval_results = model.test_epoch_end(eval_results)
  File "/root/dpr-scale/dpr_scale/task/dpr_eval_task.py", line 49, in test_epoch_end
    torch.distributed.barrier()  # make sure rank 0 waits for all to complete
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Testing: 100%|██████████| 1851/1851 [02:42<00:00, 11.40it/s]

Do you know how to fix it?

ccsasuke commented 2 years ago

Hi @roddar92, this looks like a bug in the code. Because you're running this code on a single GPU, which means distributed training is not initialized, which in turn leads to this error.

Could you try adding a check if torch.distributed.is_initialized(): before this line? https://github.com/facebookresearch/dpr-scale/blob/e8eb457edb0f0781f4bb5ebf5a157b7df23d952a/dpr_scale/task/dpr_eval_task.py#L49