facebookresearch / dpr-scale

Scalable training for dense retrieval models.
262 stars 25 forks source link

Question about deepspeed option #3

Closed roddar92 closed 1 year ago

roddar92 commented 2 years ago

Dear colleagues,

During training process on several gpus, I have an exception like this:

  File "/root/dpr-scale/dpr_scale/task/dpr_task.py", line 272, in validation_epoch_end
    self._eval_epoch_end(valid_outputs)
  File "/root/dpr-scale/dpr_scale/task/dpr_task.py", line 266, in _eval_epoch_end
    self.log_dict(metrics, on_epoch=True, sync_dist=True)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 343, in log_dict
    self.log(
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 286, in log
    self._results.log(
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/core/step_result.py", line 149, in log
    value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 290,
in reduce
    output = sync_ddp_if_available(output, group, reduce_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py", line 129, in s
ync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py", line 162, in s
ync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_r
educe
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

How could I fix this error on validation step?

My current hyperparameters for multi-training are:

accumulate_grad_batches: 1
plugins: deepspeed
accelerator: ddp
precision: 16

Thanks in advance.

ccsasuke commented 2 years ago

Unfortunately we don't have experience with deepspeed. Are you experiencing the same error without it?

roddar92 commented 2 years ago

Unfortunately, I have OOM exception with other shared option (ddp_sharded).

ccsasuke commented 2 years ago

@roddar92 Could you provide a little more details? Assuming it's CUDA OOM, can it be solved by using smaller batch sizes? Do you see the same error without ddp_sharded (using the default ddp)?

roddar92 commented 2 years ago

Well, If I decrease the size of batch, I don't see this error. The same solution I use on 1 GPU. On the other hand, I get bad quality of my trained models. Especially, TOP-5 accuracy is not higher than 10%..