When I run this command:
torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
The following error appears:
(train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Logging directory: train_outputs/outputs/tblock4_train_1gpu/log
20it [00:00, 2088.12it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python
Traceback (most recent call last):
File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
bin/train.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-10_16:19:35
host : clientadmin-Precision-3660
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7347)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
My pytorch==1.10.0 py3.8_cuda11.3_cudnn8.2.0_0
And my cuda is 11.3
Could you help me out?
Thank you very much.
Hello,
When I run this command: torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
The following error appears: (train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs 2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Logging directory: train_outputs/outputs/tblock4_train_1gpu/log 20it [00:00, 2088.12it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python Traceback (most recent call last): File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
bin/train.py FAILED
Failures: