Hi. I needed some help regarding why I am not receiving any output, or I receive a broken pipe error on certain runs of the training script. I have provided both type of results below. If more information about how I set up the training pipeline is required, I would be happy to provide. Thanks :)
The 2 scenarios:
1.
The script seems to run without error (hence the .err file being empty) however there is no actual result of the training that was produced.
2.
Found in .err:Broken pipe error:
/home/dinov2/dinov2/layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
/home/dinov2/dinov2/layers/attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
/home/dinov2/dinov2/layers/block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
submitit ERROR (2024-04-03 06:30:10,782) - Submitted job triggered an exception
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 76, in submitit_main
process_job(args.folder)
File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 69, in process_job
raise error
File "/opt/conda/lib/python3.10/site-packages/submitit/core/submission.py", line 55, in process_job
result = delayed.result()
File "/opt/conda/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/dinov2/./dinov2/run/train/train.py", line 26, in __call__
train_main(self.args)
File "/home/dinov2/dinov2/train/train.py", line 298, in main
cfg = setup(args)
File "/home/dinov2/dinov2/utils/config.py", line 69, in setup
default_setup(args)
File "/home/dinov2/dinov2/utils/config.py", line 50, in default_setup
distributed.enable(overwrite=True)
File "/home/dinov2/dinov2/distributed/__init__.py", line 264, in enable
dist.init_process_group(backend="nccl")
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 446, in _store_based_barrier
worker_count = store.add(store_key, 0)
RuntimeError: Broken pipe
Hi. I needed some help regarding why I am not receiving any output, or I receive a broken pipe error on certain runs of the training script. I have provided both type of results below. If more information about how I set up the training pipeline is required, I would be happy to provide. Thanks :)
The 2 scenarios:
1. The script seems to run without error (hence the .err file being empty) however there is no actual result of the training that was produced.
2. Found in .err: Broken pipe error: