Result of training is either no output or Broken pipe error

Hi. I needed some help regarding why I am not receiving any output, or I receive a broken pipe error on certain runs of the training script. I have provided both type of results below. If more information about how I set up the training pipeline is required, I would be happy to provide. Thanks :)

The 2 scenarios:

1. The script seems to run without error (hence the .err file being empty) however there is no actual result of the training that was produced.

2. Found in .err: Broken pipe error:

/home/dinov2/dinov2/layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
  warnings.warn("xFormers is available (SwiGLU)")
/home/dinov2/dinov2/layers/attention.py:27: UserWarning: xFormers is available (Attention)
  warnings.warn("xFormers is available (Attention)")
/home/dinov2/dinov2/layers/block.py:33: UserWarning: xFormers is available (Block)
  warnings.warn("xFormers is available (Block)")
submitit ERROR (2024-04-03 06:30:10,782) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 76, in submitit_main
    process_job(args.folder)
  File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 69, in process_job
    raise error
  File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/opt/conda/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/home/dinov2/./dinov2/run/train/train.py", line 26, in __call__
    train_main(self.args)
  File "/home/dinov2/dinov2/train/train.py", line 298, in main
    cfg = setup(args)
  File "/home/dinov2/dinov2/utils/config.py", line 69, in setup
    default_setup(args)
  File "/home/dinov2/dinov2/utils/config.py", line 50, in default_setup
    distributed.enable(overwrite=True)
  File "/home/dinov2/dinov2/distributed/__init__.py", line 264, in enable
    dist.init_process_group(backend="nccl")
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 446, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Broken pipe

facebookresearch / dinov2

Result of training is either no output or Broken pipe error #404