ailab-prompt-transfer / TextBox

Implement of PTG
https://github.com/RUCAIBox/TextBox
MIT License
0 stars 0 forks source link

torch.distributed.elastic.multiprocessing.api:failed #5

Closed minji-o-j closed 1 year ago

minji-o-j commented 1 year ago

명령어

accelerate launch run_textbox.py --model=PTG --dataset=pc --model_path=facebook/bart-large --gpu_id=0,1,2 --find_unused_parameters=true --source_task=cross_dataset2

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 616 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 614) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 950, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

minji-o-j commented 1 year ago

같은 서버에 gpu 3개 돌리는게 2개 이상이 되면 발생 (메모리 때문이라고 함)