NVlabs / nvdiffrecmc

Official code for the NeurIPS 2022 paper "Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising".
Other
367 stars 30 forks source link

Training exiting suddenly #21

Open iraj465 opened 1 year ago

iraj465 commented 1 year ago

I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.

DATA_PATH :  SHOG2TA2CKZ7ERNU                                                                                                                          
COLMAP_PATH :  /usr/local/bin/colmap                                                                                                                   
CONFIG_PATH :  /nvdiffrecmc/configs/manual/shoe.json                                                                                                   
NUMBER OF GPUS:  3                                                                                                                                     
TRAINING STARTED..                                                                                                                                     
WARNING:torch.distributed.run:                                                                                                                         
*****************************************                                                                                                              
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variab
le for optimal performance in your application as needed.                                                                                              
*****************************************                                                                                                              
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:                                                                     
  entrypoint       : train.py                                                                                                                          
  min_nodes        : 1                                                                                                                                 
  max_nodes        : 1                                                                                                                                 
  nproc_per_node   : 3                                                                                                                                 
  run_id           : none                                                                                                                              
  rdzv_backend     : static                                                                                                                            
  rdzv_endpoint    : 127.0.0.1:29500                                                                                                                   
  rdzv_configs     : {'rank': 0, 'timeout': 900}                                                                                                       
  max_restarts     : 0                                                                                                                                 
  monitor_interval : 5                                                                                                                                 
  log_dir          : None                                                                                                                              
  metrics_cfg      : {}                                                                                                                                

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_fjb7yrke/none_gz0e2q6p                         
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.8                                                   
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group                                                                  
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:                                                     
  restart_count=0                                                                                                                                      
  master_addr=127.0.0.1                                                                                                                                
  master_port=29500                                                                                                                                    
  group_rank=0                                                                                                                                         
  group_world_size=1                                                                                                                                   
  local_ranks=[0, 1, 2]                                                                                                                                
  role_ranks=[0, 1, 2]                                                                                                                                 
  global_ranks=[0, 1, 2]                                                                                                                               
  role_world_sizes=[3, 3, 3]                                                                                                                           
  global_world_sizes=[3, 3, 3]                                                                                                                         

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group      
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's 
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module optixutils_plugin...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Config / Flags:                                                                                                                                        
---------                                                                                                                                              
iter 500                                                                                                                                               
batch 4                                                                                                                                                
spp 1
layers 1
train_res [2048, 2048]
display_res [2048, 2048]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/SHOG2TA2CKZ7ERNU
config /nvdiffrecmc/configs/manual/shoe.json
ref_mesh SHOG2TA2CKZ7ERNU
base_mesh None
validate True
n_samples 12
bsdf pbr
denoiser bilateral
denoiser_demodulate True
save_custom 3D/vertical/footwear
vertical Footwear
mtl_override None
dmtet_grid 128
mesh_scale 2.5
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative                                                                                                                                       
laplace_scale 3000.0                                                                                                                                   
pre_load True                                                                                                                                          
no_perturbed_nrm False                                                                                                                                 
decorrelated False                                                                                                                                     
kd_min [0.03, 0.03, 0.03]                                                                                                                              
kd_max [0.8, 0.8, 0.8]
ks_min [0, 0.08, 0]
ks_max [0, 1, 1]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.025
lambda_diffuse 0.15
lambda_specular 0.0025
local_rank 0
multi_gpu True
random_textures True
---------
DatasetLLFF: 92 images with shape [1080, 1920]
DatasetLLFF: auto-centering at [ 0.24934715  0.38134477 -0.13031025]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5527 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5529 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5527 via 15, forcefully exitting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5529 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 5528) of binary: /opt/conda/bin/python3.8
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007698535919189453 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback in
fo. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-16_11:59:10
  host      : fea5089b80d7
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 5528)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 5528
======================================================