WillDreamer / Aurora

[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
https://arxiv.org/abs/2305.08381
80 stars 7 forks source link

关于运行环境 #8

Closed maoshanwen closed 9 months ago

maoshanwen commented 10 months ago

首先感谢您伟大的工作!但是当我尝试复现的时候出现了WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. thread panicked while processing panic. aborting. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16031 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 16032) of binary: /HOME/scw7294/.conda/envs/au/bin/python Traceback (most recent call last): File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/HOME/scw7294/.conda/envs/au/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in main() File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/HOME/scw7294/.conda/envs/au/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_retrieval.py FAILED

Failures: [1]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 2 (local_rank: 2) exitcode : -6 (pid: 16033) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16033 [2]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 3 (local_rank: 3) exitcode : -6 (pid: 16034) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16034

Root Cause (first observed failure): [0]: time : 2023-11-20_21:03:20 host : g0210.para.ai rank : 1 (local_rank: 1) exitcode : -6 (pid: 16032) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16032

我的环境为(均符合readme.txt中的要求) (au) [scw7294@ln01 Aurora]$ pip list Package Version


adapter 0.1 adapters 0.0.0.dev20231116 av 11.0.0 certifi 2023.11.17 charset-normalizer 3.3.2 einops 0.7.0 fairscale 0.4.13 filelock 3.13.1 fsspec 2023.10.0 huggingface-hub 0.19.4 idna 3.4 numpy 1.24.4 packaging 23.2 Pillow 10.0.1 pip 23.3 PyYAML 6.0.1 regex 2023.10.3 requests 2.31.0 ruamel.yaml 0.18.5 ruamel.yaml.clib 0.2.8 safetensors 0.4.0 setuptools 68.0.0 timm 0.9.10 tokenizers 0.13.3 torch 1.13.0+cu117 torchvision 0.14.0+cu117 tqdm 4.66.1 transformers 4.33.3 typing_extensions 4.8.0 urllib3 2.1.0 wheel 0.41.2。 由于没有明确报错我也不清楚是哪个环境出了问题,我是单机8卡运行(报错是4卡但是8卡报错一样),我想请问您是用啥配置,以及相应环境有更具体的要求吗?

xinlong-yang commented 9 months ago

我遇到过类似的bug,可能是因为你用的是nohup .... &这种运行方式?nohup与Pytorch DDP一起使用会有类似这种bug,进程自己终止了,因此我建议你用tmux或者直接终端等待

maoshanwen commented 9 months ago

谢谢,这个之前解决了。好像是因为某个版本太高了,我把4090换成3090然后降低版本就解决了。