NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)
https://research.nvidia.com/labs/dir/neuralangelo/
Other
4.31k stars 387 forks source link

docker CUDA error #114

Open aiertamundarain opened 11 months ago

aiertamundarain commented 11 months ago

I have launch a train in neuralangelo docker ang raises the following error

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

chenhsuanlin commented 11 months ago

Hi @aiertamundarain, could you post the full error log? Thanks!

aiertamundarain commented 11 months ago

Hi @chenhsuanlin This is the error log Traceback (most recent call last): File "train.py", line 104, in main() File "train.py", line 93, in main trainer.train(cfg, File "/workspace/neuralangelo/projects/neuralangelo/trainer.py", line 107, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 115, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 504, in train self.train_step(data, last_iter_in_epoch=(it == len(data_loader) - 1)) File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 441, in train_step total_loss = self.model_forward(data) File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 92, in model_forward output = self.model(data) # data = self.model(data) will not return the same data in the case of DDP. File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1137, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1091, in _run_ddp_forward return self.module(inputs[0], kwargs[0]) # type: ignore[index] File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl return forward_call(args, kwargs) File "/workspace/neuralangelo/projects/neuralangelo/model.py", line 68, in forward output = self.render_pixels(data["pose"], data["intr"], image_size=self.image_size_train, File "/workspace/neuralangelo/projects/neuralangelo/model.py", line 121, in render_pixels output = self.render_rays(center, ray_unit, sample_idx=sample_idx, stratified=stratified) File "/workspace/neuralangelo/projects/neuralangelo/model.py", line 130, in render_rays output_background = self.render_rays_background(center, ray_unit, far, app_outside, stratified=stratified) File "/workspace/neuralangelo/projects/neuralangelo/model.py", line 197, in render_rays_background rgbs, densities = self.background_nerf.forward(points, rays_unit, app_outside) # [B,R,N,3] File "/workspace/neuralangelo/projects/neuralangelo/utils/modules.py", line 273, in forward points_enc = self.encode(points_3D) # [...,4+LD] File "/workspace/neuralangelo/projects/neuralangelo/utils/modules.py", line 299, in encode points_enc = nerf_util.positional_encoding(points, num_freq_bases=self.cfg_background.encoding.levels) File "/workspace/neuralangelo/projects/nerf/utils/nerf_util.py", line 142, in positional_encoding freq = 2 torch.arange(num_freq_bases, dtype=torch.float32, device=input.device) np.pi # [L]. File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 40, in wrapped return f(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 878, in rpow return torch.tensor(other, dtype=dtype, device=self.device) self RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1092) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

chenhsuanlin commented 11 months ago

It doesn't look like the correct stack trace. Could you share the error log with CUDA_LAUNCH_BLOCKING=1 as suggested from the error message? (It would also be great if it could be formatted as code in the issue/comments!)