SamsungLabs / MLI

Novel View Synthesis with multiplane/multilayer representation: CVPR2022, WACV2023
Other
143 stars 10 forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary #18

Open bearatom opened 1 year ago

bearatom commented 1 year ago

Hello,

When I run this command: torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs

The following error appears: (train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs 2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Logging directory: train_outputs/outputs/tblock4_train_1gpu/log 20it [00:00, 2088.12it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python Traceback (most recent call last): File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')()) File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

bin/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-10_16:19:35 host : clientadmin-Precision-3660 rank : 0 (local_rank: 0) exitcode : 1 (pid: 7347) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ My pytorch==1.10.0 py3.8_cuda11.3_cudnn8.2.0_0 And my cuda is 11.3 Could you help me out? Thank you very much.