CUDA error after opacity resets

lolwarmaze commented 2 months ago

Hi I have been trying to train a stereo scene with left and right camera images. I was getting good rendering quality on the trained cameras, however if I have veiwing with a pov of let's say between the two training cameras, i see many floaters. Hence I have modified the optimization parameters to reset the opacity every 500 iterations. But after 500 iterations, the training continues for a few iterations and I get a CUDA error like this:

(hustvl4dgs) zomhussa@zjlxq00046:~/4dgs/Stereo4DGS$ python train.py -s data/hanging_rings/ --port 6017 --expname hanging_rings_sopacityreset --configs arguments/default.py 
Optimizing 
Output folder: ./output/hanging_rings_sopacityreset [24/06 09:25:44]
feature_dim: 256 [24/06 09:25:44]
meta data loaded, total left and right image pairs:245 [24/06 09:25:47]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 214/214 [04:03<00:00,  1.14s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:26<00:00,  1.16it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245/245 [01:43<00:00,  2.36it/s]
Found poses_bounds.py and extra marks with EndoNeRf [24/06 09:33:46]
self.cameras_extent is  1.1 [24/06 09:33:46]
Loading Training Cameras [24/06 09:33:46]
Loading Test Cameras [24/06 09:33:46]
Loading Video Cameras [24/06 09:33:46]
Voxel Plane: set aabb= Parameter containing:
tensor([[ 215.1176,  129.0609,  255.0000],
        [-190.8275, -129.7819,    2.0000]], requires_grad=True) [24/06 09:33:46]
Number of points at initialisation :  30000 [24/06 09:33:47]
Training progress:   0%|                                                                                                                           | 0/1000 [00:00<?, ?it/s]Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] [24/06 09:33:49]
Loading model from: /home/zomhussa/anaconda3/envs/hustvl4dgs/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth [24/06 09:33:51]
Training progress: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [01:12<00:00, 13.88it/s, Loss=0.0259888, psnr=27.99, point=30122]
reset opacity [24/06 09:35:01]
Training progress:   0%|                                                                                                                           | 0/3999 [00:00<?, ?it/s]Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] [24/06 09:35:01]
Loading model from: /home/zomhussa/anaconda3/envs/hustvl4dgs/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth [24/06 09:35:04]
Training progress:  13%|█████████▏                                                              | 510/3999 [02:33<12:03,  4.82it/s, Loss=0.1240873, psnr=19.69, point=32174]Traceback (most recent call last):
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 358, in <module>
    training(lp.extract(args), hp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, \
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 249, in training
    scene_reconstruction(dataset, opt, hyper, pipe, testing_iterations, saving_iterations,
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 115, in scene_reconstruction
    gt_image = viewpoint_cam.original_image.cuda().float()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:  13%|█████████▏                                                              | 510/3999 [02:39<18:08,  3.21it/s, Loss=0.1240873, psnr=19.69, point=32174]

What can i do so that I am able to use the opacity reset feature without running into errors ? Thanks

yifliu3 commented 2 months ago

Hi, thanks for the attention.

This error may come from several reasons, and here are some potential solutions:

Drop out depth constraints to avoid numeral instability
Use gradient clip to get over with unusual gradients
Tune the learning rate

Hope this can solve your problem.

lolwarmaze commented 2 months ago

Thank you I will try these solutions. One more question. I have also had a few cases now where during the "fine" stage the PSNR suddenly drops from 20-30s to between 5 and 6 and it just stays there till the training completes. What could be the reason for it ?

yifliu3 commented 2 months ago

I think that is caused by the unstable learning, and the network has been trapped in a local minimum. Typically this can be solved by tuning learning rates, you can have a try :)

CUHK-AIM-Group / EndoGaussian

CUDA error after opacity resets #23