BaowenZ / RaDe-GS

RaDe-GS: Rasterizing Depth in Gaussian Splatting
Other
460 stars 25 forks source link

CUDA error when I apply my own dataset. #4

Open Liu-SD opened 3 months ago

Liu-SD commented 3 months ago

The resolution of my dataset is 5236x3909. I scale down the resolution by 4 and the actual render resolution is 1309x977.

Now I get the runtime error as follows:

cameras extent: 381.5180541992188 [19/06 15:31:45] Loading Training Cameras: 10 . [19/06 15:56:00]
0it [00:00, ?it/s]
Loading Test Cameras: 0 . [19/06 15:56:00]
Number of points at initialisation : 23947 [19/06 15:56:00]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s] Traceback (most recent call last):
File "/home/liu/nerf/RaDe-GS/train.py", line 312, in
training(dataset=lp.extract(args),
File "/home/liu/nerf/RaDe-GS/train.py", line 115, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/liu/nerf/RaDe-GS/gaussian_renderer/init.py", line 87, in render
"visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered

What's the reason and how to solve it? Thanks a lot!

brianneoberson commented 3 months ago

Hi, I have this error even when training on DTU (scan24) dataset. Would also appreciate some help regarding this. :)

edit: I am using RTX 6000 with cuda 11.8

BaowenZ commented 3 months ago

Hi! it seems the error happens in CUDA part. but currently I don't have any idea on it. I tested the code on two machines with different GPUs (H800 and 4080) but can't reproduce this error. I will appreciate if you can provide with more information. Thank you!

LinzhouLi commented 3 months ago

Hi! I encounter the same issue on RTX 3090 and cuda 11.8

Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
zhouilu commented 3 months ago

Same error. I check render's input, scale rot opacity have Nan. Why?

BaowenZ commented 3 months ago

Thank you for info. This issue seems related to the machines. Currently, RTX 4080 with CUDA 12.1 works well. I'm looking for other computers to reproduce this error and fix it.

LinzhouLi commented 3 months ago

I found this issue still exists on CUDA 12.1 and RTX 3090. It occasionally happens during training.

Training progress:  85%|████████████████████████████████████████████████████▉         | 25630/30000 [25:43<03:22, 21.63it/s, Loss=0.0226, loss_dep=0.0000, loss_normal=0.1220]
Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 150, in training
    depth_middepth_normal, _ = depth_double_to_normal(viewpoint_cam, rendered_depth, rendered_middepth)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 118, in depth_double_to_normal
    points1, points2 = depths_double_to_points(view, depth1, depth2)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 105, in depths_double_to_points
    ).float().cuda()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
MrNeRF commented 3 months ago

Can confirm! In case it does not crash, it produces consistently results on custom data as depicted on the attached rendering output (did not try any of the official data). I tried to deactivate the appearence embedding but it does not help. Might be due to the distortion loss? Not sure. But apparently there is a bug in the rasterizer implementation. Screenshot from 2024-06-19 10-24-54

WUMINGCHAzero commented 3 months ago

Grad Nan after backwarding on custom data. Need Help. Thanks! Env: torch1.13.1+cu117, A800 GPU

A quick test: this grad error still exist after updating forward.cu in your PR

RongLiu-Leo commented 3 months ago

Same error. It just occasionally happens. Like running experiments 5 times and being successful once.

MELANCHOLY828 commented 3 months ago

I've encountered the same issue with CUDA 12.1.

zhanghaoyu816 commented 3 months ago

I have also encountered the same issue on RTX 4090 with CUDA 11.8, Pytorch 2.1.2, Ubuntu 22.04. As mentioned by others earlier, this error occurs randomly during training process.

Training progress:   0%|                                                                                                                                                                                                                      | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 312, in <module>
    training(dataset=lp.extract(args),
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

One solution that might help is issues/41, but I haven't try it...

tkuye commented 3 months ago

Same error as well. Grad NaN on two different datasets.

BaowenZ commented 3 months ago

Thank you for important information. I have fixed the problem. Please update the code.

MrNeRF commented 3 months ago

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

Li-colonel commented 3 months ago

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

Have you verified whether it is due to distortion loss? An issue was reported in 2DGS and then they changed the default value of its corresponding hyperparameter to 0.0

MrNeRF commented 3 months ago

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

image

BaowenZ commented 3 months ago

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

image

Are you using the viewer in this Repository?

MrNeRF commented 3 months ago

I printed every 100th image. The images are very good. Different from what I see in the viewer. Maybe there is some conversion issue while saving the ply file?

MrNeRF commented 3 months ago

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

BaowenZ commented 3 months ago

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

Yes, I made some modification for 3D filters. You can use it in the same way as the original viewer. And I think we get the reason and I'll update the README for the viewer. Looking forward to good news.

MrNeRF commented 3 months ago

Obviously that was the issue. The rendering is actually quite nice and confirms the reported psnr. Thx for the help.

MELANCHOLY828 commented 3 months ago

image The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that?

BaowenZ commented 3 months ago

image The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that? Please help me translate this into English.

Please use the viewer.

WUMINGCHAzero commented 3 months ago

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

BaowenZ commented 3 months ago

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

I can't open the ply files by the original viewer so I can't reproduce it. But I guess the ply files are wrongly parsed because other codes don't know my format (meaning or order of the variables in the file).

Mikael-Spotscale commented 3 months ago

I can confirm that the latest updates fixed the CUDA error for me.

Anybody else in the same situation, don't forget to reinstall the module with pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization after pulling the code.

lala-sean commented 1 month ago

The resolution of my dataset is 5236x3909. I scale down the resolution by 4 and the actual render resolution is 1309x977.

Now I get the runtime error as follows:

cameras extent: 381.5180541992188 [19/06 15:31:45] Loading Training Cameras: 10 . [19/06 15:56:00] 0it [00:00, ?it/s] Loading Test Cameras: 0 . [19/06 15:56:00] Number of points at initialisation : 23947 [19/06 15:56:00] Training progress: 0%| | 0/30000 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/liu/nerf/RaDe-GS/train.py", line 312, in training(dataset=lp.extract(args), File "/home/liu/nerf/RaDe-GS/train.py", line 115, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/liu/nerf/RaDe-GS/gaussian_renderer/init.py", line 87, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered

What's the reason and how to solve it? Thanks a lot!

Hi @Liu-SD @zhouilu, can I ask if you solved this issue or not? I also encountered this issue at the first iteration and couldn't work it out by updating the submodule.