I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.
DATA_PATH : SHOG2TA2CKZ7ERNU
COLMAP_PATH : /usr/local/bin/colmap
CONFIG_PATH : /nvdiffrecmc/configs/manual/shoe.json
NUMBER OF GPUS: 3
TRAINING STARTED..
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variab
le for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 3
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_fjb7yrke/none_gz0e2q6p
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.8
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module optixutils_plugin...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Config / Flags:
---------
iter 500
batch 4
spp 1
layers 1
train_res [2048, 2048]
display_res [2048, 2048]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/SHOG2TA2CKZ7ERNU
config /nvdiffrecmc/configs/manual/shoe.json
ref_mesh SHOG2TA2CKZ7ERNU
base_mesh None
validate True
n_samples 12
bsdf pbr
denoiser bilateral
denoiser_demodulate True
save_custom 3D/vertical/footwear
vertical Footwear
mtl_override None
dmtet_grid 128
mesh_scale 2.5
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000.0
pre_load True
no_perturbed_nrm False
decorrelated False
kd_min [0.03, 0.03, 0.03]
kd_max [0.8, 0.8, 0.8]
ks_min [0, 0.08, 0]
ks_max [0, 1, 1]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.025
lambda_diffuse 0.15
lambda_specular 0.0025
local_rank 0
multi_gpu True
random_textures True
---------
DatasetLLFF: 92 images with shape [1080, 1920]
DatasetLLFF: auto-centering at [ 0.24934715 0.38134477 -0.13031025]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5527 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5529 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5527 via 15, forcefully exitting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5529 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 5528) of binary: /opt/conda/bin/python3.8
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007698535919189453 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback in
fo. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-16_11:59:10
host : fea5089b80d7
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 5528)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 5528
======================================================
I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.