Open lijiek opened 6 months ago
sh train.sh ubun:2375340:2375340 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0> ubun:2375340:2375340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ubun:2375340:2375340 [0] NCCL INFO NET/IB : No device found. ubun:2375340:2375340 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0> ubun:2375340:2375340 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 ubun:2375341:2375341 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0> ubun:2375341:2375341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ubun:2375341:2375341 [0] NCCL INFO NET/IB : No device found. ubun:2375341:2375341 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0> ubun:2375341:2375341 [0] NCCL INFO Using network Socket
ubun:2375341:2375387 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000 ubun:2375341:2375387 [0] NCCL INFO init.cc:840 -> 5 ubun:2375341:2375387 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
ubun:2375340:2375386 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000 ubun:2375340:2375386 [0] NCCL INFO init.cc:840 -> 5 ubun:2375340:2375386 [0] NCCL INFO group.cc:73 -> 5 [Async thread] Traceback (most recent call last): File "runner.py", line 70, in Traceback (most recent call last): File "runner.py", line 70, in main() File "runner.py", line 50, in main main() File "runner.py", line 50, in main init_distributed_mode() File "runner.py", line 31, in init_distributed_mode init_distributed_mode() File "runner.py", line 31, in init_distributed_mode torch.distributed.init_process_group(backend='nccl') File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group torch.distributed.init_process_group(backend='nccl') File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier barrier() File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/lrgt/bin/python', '-u', 'runner.py', '--local_rank=1']' returned non-zero exit status 1.
Hi, do you use one GPU or multiple GPUs?
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU?
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU?
We run successfully using a single GPU. It needs to change the corresponding parameters in the script (we descript it in READEME.md). By the way, what's the version of pytorch in your environment?
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU?
We run successfully using a single GPU. It needs to change the corresponding parameters in the script (we descript it in READEME.md). By the way, what's the version of pytorch in your environment?
sorry, later for reply. my computer's disk crash. pytorch in my environment is 2.2.2
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU?
We run successfully using a single GPU. It needs to change the corresponding parameters in the script (we descript it in READEME.md). By the way, what's the version of pytorch in your environment?
sorry, later for reply. my computer's disk crash. pytorch in my environment is 2.2.2
Hi, your version of pytorch seems to be high. We recommend using pytorch version 1.7.1.
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU?
We run successfully using a single GPU. It needs to change the corresponding parameters in the script (we descript it in READEME.md). By the way, what's the version of pytorch in your environment?
sorry, later for reply. my computer's disk crash. pytorch in my environment is 2.2.2
Hi, your version of pytorch seems to be high. We recommend using pytorch version 1.7.1.
OK. I will try with prtorch 1.7.1. thanks so much for your comments.
What configuration issue causes the error below? the first nonexisting volume file as below, it can be found in directory ShapeNetVox32 and ShapeNetRending.
[WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/d608bdcd8a87f3af7d2dc2b4ad06dc44 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/aac813074775281a4163d08524f89006 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6737f75bb87e3cc0847c4e55bb965ab0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/27e9a27b6e3b49c227ac692756be9e24 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a5fa8ae8f743d5498052128bafa4f7d8 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5456c7546e3f3c3d9c5408f4f799fe72 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/7b602de66f5eff247991cd6455da4fb3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e3923f2d2fc2d1d39263b5578aef09fa since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8497e02fa1662113776d8bc79b9caa2c since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/f5a0bce67dca5ccbe3de75b155d3b403 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/39875815bdfdce2193b1b9ed21f1fb92 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ff2b8253ca3190d5d65fb76f5f0a1db7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5e8ce498a93fb7eae1a9c234926c21e2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e49935adf322de2f77e672c4996ec4a3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5b86640d3bc2e43decac3f40526a2cc2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a431fbb7e58ef0c46c03c11657c96c60 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8b335b0be37debefd85e5191b992b560 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5bc8a432a5911a4c14621506c22882a0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/2dbe5ea82a45443b71f3cc81eb6c076e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/1522b8c3c28a9d57ace571be2585c620 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/14241942d79f89226587cb13c78fb9b since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/80d381a6760185d8c45977b13fbe7645 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/313aaf9d79105fea82fd5ed7e39258c7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/eadf347da5aa8877da97052ff1f36504 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/7400be7b247ce021be99fd8a5f540d8f since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/954c459bc6762abc24f2ecb72410a6d9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ddd02c6fb780d4f6c683d3e7114aaa37 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/358bac5415f0773941d6cd38228b9631 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5d301491ba435b71257fc1c453f165b6 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a70c472cef0c354dba2abf2ecc57eeda since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5e9890e2f4cab96dbc6cd96a5e6546c since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/d74ea3567f8861dc182929c56117755a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/52fb0261b905147d2fe023c7dc3e5231 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/296f0b6a15012e33d87f29c9afcc633e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/4d8ae6327ab4ed301e66f1783a4812d7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8877086211c9976cd27beaa6c9701d39 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6cc258399daf767ae1942965937d3cef since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/147fd27a40793d7e9bbe4f1047e9e5fd since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/70dcc0d460cacce0e63ec060b551ac57 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/bb248dedd82b2f28deed0e4a55ad5dd6 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/4e6264af2f2c3e135a15c264bb25007a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/320683da0d8bd1f394a6427195fa0bd9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ad8623ad47d72317eda0f8d4b3ce03d since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/3216e49e5dd304956bed41d0253513f3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e6a9f9135e36b6c17c0ab7347b9e831a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5d48d75153eb221b476c772fd813166d since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/c189e1768a4f291d4de203ef6842ee61 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a2f46716962afe72b106d5ef46e12c19 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/9625f56fdbada3377220891f188bc420 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/20d1090d07a49fe927ac692756be9e24 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6e1781a84b5dbda6fb3e64e796c0391a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/3b8c16861667747fcfea3d4fc15719ea since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e88c7403ff401716b7002bddf0942f8e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/408cdd476e8bb202852ae095a967f0ca since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/cce41dda51ef0335a413908c0e169330 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/2b17c0705ee0426e53b2b4f48361e0b2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/cdd493581ed137c5a6dae8586082d789 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/29af666e04825f66576378847ca0b69 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/6a0da87e5858b4373e45dd2f173cbf9b since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/bbf1507f7126733665224ccd01ad35d4 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7a1754186937d247ad78ed9a26ab1091 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/799d446d33e604f990f7927ebadab5fc since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/bf0084fbcc74d5632754043d4b10740c since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/e1e3c053f4b4f1405e45696ec6d1a105 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/8c3148ee031b15446e0dbba30ac27e8 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7eedbeaa5216ff06ccd600f441988364 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/9119eaa9b5996cd3b1bb46d2556ba67d since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/ee09bd0664e0a02292b9fcc49a614e2b since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/5a0ca7a6df97542e69c3c818538357ad since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/82138919402b3b8f642f9e27aaf0c47a since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/1356fcf0ea4a95bcbe7ca2216dc1576a since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/201fdafa7287d2fe8a55122197709269 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/133c9fa2562498d28ae10bd53dffee76 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/f8d3af1d1a20123c249ba97ee36ba54 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7b568afff918289614621506c22882a0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/3899f8b55ec9f1db8a1ec28cb7d97871 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/5eb2d085267142f26192896700aa3bd4 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/65f78142a6c33a89ea7dce1646d86149 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/e64bb04b54022e708d7bd537eb907025 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/2571a0b3d1eb9280f26f17fb5c4740a9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/fd52c80ae21d19251e0d0f6bac6856eb since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/a9737969ac039c9323dfd33205b06c1a since volume file not exists.
I think the above nonexisting volume file result in the following error, is it right? how to solve it? thanks.
Merger: Similar Token Merger (STM)
[INFO] 2024-05-17 14:37:54,622 Parameters in Encoder: 85892352.
[INFO] 2024-05-17 14:37:54,622 Parameters in Decoder: 29798625.
[INFO] 2024-05-17 14:37:54,623 Parameters in Merger: 14172674.
Setting sync_batchnorm ...
/home/ubuntu/anaconda3/envs/lrgt_env/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
[INFO] 2024-05-17 14:37:55,954 [Epoch 1/110] EpochTime = 0.611 (s) Loss = 0.0000
0it [00:00, ?it/s]
Traceback (most recent call last):
File "runner.py", line 70, in
CPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/build/aten/src/ATen/RegisterCPU.cpp:18433 [kernel] CUDA: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/build/aten/src/ATen/RegisterCUDA.cpp:26493 [kernel] QuantizedCPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/build/aten/src/ATen/RegisterQuantizedCPU.cpp:1068 [kernel] BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Python: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback] Named: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] Conjugate: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback] Negative: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback] ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback] AutogradOther: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradLazy: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradXPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradMLC: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradHPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradNestedTensor: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/VariableType_3.cpp:10215 [autograd kernel] Tracer: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/torch/csrc/autograd/generated/TraceType_3.cpp:11593 [kernel] UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/autocast_mode.cpp:466 [backend fallback] Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/autocast_mode.cpp:305 [backend fallback] Batched: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback] VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1640811792945/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Failures:
What configuration issue causes the error below? the first nonexisting volume file as below, it can be found in directory ShapeNetVox32 and ShapeNetRending. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/d608bdcd8a87f3af7d2dc2b4ad06dc44 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/aac813074775281a4163d08524f89006 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6737f75bb87e3cc0847c4e55bb965ab0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/27e9a27b6e3b49c227ac692756be9e24 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a5fa8ae8f743d5498052128bafa4f7d8 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5456c7546e3f3c3d9c5408f4f799fe72 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/7b602de66f5eff247991cd6455da4fb3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e3923f2d2fc2d1d39263b5578aef09fa since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8497e02fa1662113776d8bc79b9caa2c since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/f5a0bce67dca5ccbe3de75b155d3b403 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/39875815bdfdce2193b1b9ed21f1fb92 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ff2b8253ca3190d5d65fb76f5f0a1db7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5e8ce498a93fb7eae1a9c234926c21e2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e49935adf322de2f77e672c4996ec4a3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5b86640d3bc2e43decac3f40526a2cc2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a431fbb7e58ef0c46c03c11657c96c60 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8b335b0be37debefd85e5191b992b560 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5bc8a432a5911a4c14621506c22882a0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/2dbe5ea82a45443b71f3cc81eb6c076e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/1522b8c3c28a9d57ace571be2585c620 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/14241942d79f89226587cb13c78fb9b since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/80d381a6760185d8c45977b13fbe7645 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/313aaf9d79105fea82fd5ed7e39258c7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/eadf347da5aa8877da97052ff1f36504 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/7400be7b247ce021be99fd8a5f540d8f since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/954c459bc6762abc24f2ecb72410a6d9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ddd02c6fb780d4f6c683d3e7114aaa37 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/358bac5415f0773941d6cd38228b9631 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5d301491ba435b71257fc1c453f165b6 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a70c472cef0c354dba2abf2ecc57eeda since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5e9890e2f4cab96dbc6cd96a5e6546c since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/d74ea3567f8861dc182929c56117755a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/52fb0261b905147d2fe023c7dc3e5231 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/296f0b6a15012e33d87f29c9afcc633e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/4d8ae6327ab4ed301e66f1783a4812d7 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/8877086211c9976cd27beaa6c9701d39 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6cc258399daf767ae1942965937d3cef since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/147fd27a40793d7e9bbe4f1047e9e5fd since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/70dcc0d460cacce0e63ec060b551ac57 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/bb248dedd82b2f28deed0e4a55ad5dd6 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/4e6264af2f2c3e135a15c264bb25007a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/320683da0d8bd1f394a6427195fa0bd9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/ad8623ad47d72317eda0f8d4b3ce03d since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/3216e49e5dd304956bed41d0253513f3 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e6a9f9135e36b6c17c0ab7347b9e831a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/5d48d75153eb221b476c772fd813166d since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/c189e1768a4f291d4de203ef6842ee61 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/a2f46716962afe72b106d5ef46e12c19 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/9625f56fdbada3377220891f188bc420 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/20d1090d07a49fe927ac692756be9e24 since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/6e1781a84b5dbda6fb3e64e796c0391a since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/3b8c16861667747fcfea3d4fc15719ea since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/e88c7403ff401716b7002bddf0942f8e since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/408cdd476e8bb202852ae095a967f0ca since volume file not exists. [WARNING] 2024-05-17 14:37:53,858 Ignore sample 04530566/cce41dda51ef0335a413908c0e169330 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/2b17c0705ee0426e53b2b4f48361e0b2 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/cdd493581ed137c5a6dae8586082d789 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/29af666e04825f66576378847ca0b69 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/6a0da87e5858b4373e45dd2f173cbf9b since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/bbf1507f7126733665224ccd01ad35d4 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7a1754186937d247ad78ed9a26ab1091 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/799d446d33e604f990f7927ebadab5fc since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/bf0084fbcc74d5632754043d4b10740c since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/e1e3c053f4b4f1405e45696ec6d1a105 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/8c3148ee031b15446e0dbba30ac27e8 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7eedbeaa5216ff06ccd600f441988364 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/9119eaa9b5996cd3b1bb46d2556ba67d since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/ee09bd0664e0a02292b9fcc49a614e2b since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/5a0ca7a6df97542e69c3c818538357ad since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/82138919402b3b8f642f9e27aaf0c47a since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/1356fcf0ea4a95bcbe7ca2216dc1576a since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/201fdafa7287d2fe8a55122197709269 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/133c9fa2562498d28ae10bd53dffee76 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/f8d3af1d1a20123c249ba97ee36ba54 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/7b568afff918289614621506c22882a0 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/3899f8b55ec9f1db8a1ec28cb7d97871 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/5eb2d085267142f26192896700aa3bd4 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/65f78142a6c33a89ea7dce1646d86149 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/e64bb04b54022e708d7bd537eb907025 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/2571a0b3d1eb9280f26f17fb5c4740a9 since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/fd52c80ae21d19251e0d0f6bac6856eb since volume file not exists. [WARNING] 2024-05-17 14:37:53,859 Ignore sample 04530566/a9737969ac039c9323dfd33205b06c1a since volume file not exists.
Hi, you may need to check the path of your dataset.
sh train.sh ubun:2375340:2375340 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0> ubun:2375340:2375340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ubun:2375340:2375340 [0] NCCL INFO NET/IB : No device found. ubun:2375340:2375340 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0> ubun:2375340:2375340 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 ubun:2375341:2375341 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0> ubun:2375341:2375341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ubun:2375341:2375341 [0] NCCL INFO NET/IB : No device found. ubun:2375341:2375341 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0> ubun:2375341:2375341 [0] NCCL INFO Using network Socket
ubun:2375341:2375387 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000 ubun:2375341:2375387 [0] NCCL INFO init.cc:840 -> 5 ubun:2375341:2375387 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
ubun:2375340:2375386 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000 ubun:2375340:2375386 [0] NCCL INFO init.cc:840 -> 5 ubun:2375340:2375386 [0] NCCL INFO group.cc:73 -> 5 [Async thread] Traceback (most recent call last): File "runner.py", line 70, in
Traceback (most recent call last):
File "runner.py", line 70, in
main()
File "runner.py", line 50, in main
main()
File "runner.py", line 50, in main
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/lrgt/bin/python', '-u', 'runner.py', '--local_rank=1']' returned non-zero exit status 1.