Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
thread panicked while processing panic. aborting.
thread panicked while processing panic. aborting.
thread panicked while processing panic. aborting.
thread panicked while processing panic. aborting.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16031 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 16032) of binary: /HOME/scw7294/.conda/envs/au/bin/python
Traceback (most recent call last):
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_retrieval.py FAILED
Failures:
[1]:
time : 2023-11-20_21:03:20
host : g0210.para.ai
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 16033)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 16033
[2]:
time : 2023-11-20_21:03:20
host : g0210.para.ai
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 16034)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 16034
Root Cause (first observed failure):
[0]:
time : 2023-11-20_21:03:20
host : g0210.para.ai
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 16032)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 16032
我的环境为(均符合readme.txt中的要求)
(au) [scw7294@ln01 Aurora]$ pip list
Package Version
首先感谢您伟大的工作!但是当我尝试复现的时候出现了WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16031 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 16032) of binary: /HOME/scw7294/.conda/envs/au/bin/python Traceback (most recent call last): File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_retrieval.py FAILED
Failures: [1]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 2 (local_rank: 2) exitcode : -6 (pid: 16033) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16033 [2]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 3 (local_rank: 3) exitcode : -6 (pid: 16034) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16034
Root Cause (first observed failure): [0]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 1 (local_rank: 1) exitcode : -6 (pid: 16032) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16032
我的环境为(均符合readme.txt中的要求) (au) [scw7294@ln01 Aurora]$ pip list Package Version
adapter 0.1 adapters 0.0.0.dev20231116 av 11.0.0 certifi 2023.11.17 charset-normalizer 3.3.2 einops 0.7.0 fairscale 0.4.13 filelock 3.13.1 fsspec 2023.10.0 huggingface-hub 0.19.4 idna 3.4 numpy 1.24.4 packaging 23.2 Pillow 10.0.1 pip 23.3 PyYAML 6.0.1 regex 2023.10.3 requests 2.31.0 ruamel.yaml 0.18.5 ruamel.yaml.clib 0.2.8 safetensors 0.4.0 setuptools 68.0.0 timm 0.9.10 tokenizers 0.13.3 torch 1.13.0+cu117 torchvision 0.14.0+cu117 tqdm 4.66.1 transformers 4.33.3 typing_extensions 4.8.0 urllib3 2.1.0 wheel 0.41.2。 由于没有明确报错我也不清楚是哪个环境出了问题,我是单机8卡运行(报错是4卡但是8卡报错一样),我想请问您是用啥配置,以及相应环境有更具体的要求吗?