CUHK-AIM-Group / EndoGaussian

EndoGaussian: Real-time Gaussian Splatting for Dynamic Endoscopic Scene Reconstruction
https://yifliu3.github.io/EndoGaussian/
MIT License
100 stars 5 forks source link

CUDA error after opacity resets #23

Closed lolwarmaze closed 2 months ago

lolwarmaze commented 2 months ago

Hi I have been trying to train a stereo scene with left and right camera images. I was getting good rendering quality on the trained cameras, however if I have veiwing with a pov of let's say between the two training cameras, i see many floaters. Hence I have modified the optimization parameters to reset the opacity every 500 iterations. But after 500 iterations, the training continues for a few iterations and I get a CUDA error like this:

(hustvl4dgs) zomhussa@zjlxq00046:~/4dgs/Stereo4DGS$ python train.py -s data/hanging_rings/ --port 6017 --expname hanging_rings_sopacityreset --configs arguments/default.py 
Optimizing 
Output folder: ./output/hanging_rings_sopacityreset [24/06 09:25:44]
feature_dim: 256 [24/06 09:25:44]
meta data loaded, total left and right image pairs:245 [24/06 09:25:47]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 214/214 [04:03<00:00,  1.14s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:26<00:00,  1.16it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245/245 [01:43<00:00,  2.36it/s]
Found poses_bounds.py and extra marks with EndoNeRf [24/06 09:33:46]
self.cameras_extent is  1.1 [24/06 09:33:46]
Loading Training Cameras [24/06 09:33:46]
Loading Test Cameras [24/06 09:33:46]
Loading Video Cameras [24/06 09:33:46]
Voxel Plane: set aabb= Parameter containing:
tensor([[ 215.1176,  129.0609,  255.0000],
        [-190.8275, -129.7819,    2.0000]], requires_grad=True) [24/06 09:33:46]
Number of points at initialisation :  30000 [24/06 09:33:47]
Training progress:   0%|                                                                                                                           | 0/1000 [00:00<?, ?it/s]Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] [24/06 09:33:49]
Loading model from: /home/zomhussa/anaconda3/envs/hustvl4dgs/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth [24/06 09:33:51]
Training progress: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [01:12<00:00, 13.88it/s, Loss=0.0259888, psnr=27.99, point=30122]
reset opacity [24/06 09:35:01]
Training progress:   0%|                                                                                                                           | 0/3999 [00:00<?, ?it/s]Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] [24/06 09:35:01]
Loading model from: /home/zomhussa/anaconda3/envs/hustvl4dgs/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth [24/06 09:35:04]
Training progress:  13%|█████████▏                                                              | 510/3999 [02:33<12:03,  4.82it/s, Loss=0.1240873, psnr=19.69, point=32174]Traceback (most recent call last):
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 358, in <module>
    training(lp.extract(args), hp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, \
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 249, in training
    scene_reconstruction(dataset, opt, hyper, pipe, testing_iterations, saving_iterations,
  File "/home/zomhussa/4dgs/Stereo4DGS/train.py", line 115, in scene_reconstruction
    gt_image = viewpoint_cam.original_image.cuda().float()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:  13%|█████████▏                                                              | 510/3999 [02:39<18:08,  3.21it/s, Loss=0.1240873, psnr=19.69, point=32174]

What can i do so that I am able to use the opacity reset feature without running into errors ? Thanks

yifliu3 commented 2 months ago

Hi, thanks for the attention.

This error may come from several reasons, and here are some potential solutions:

  1. Drop out depth constraints to avoid numeral instability
  2. Use gradient clip to get over with unusual gradients
  3. Tune the learning rate

Hope this can solve your problem.

lolwarmaze commented 2 months ago

Thank you I will try these solutions. One more question. I have also had a few cases now where during the "fine" stage the PSNR suddenly drops from 20-30s to between 5 and 6 and it just stays there till the training completes. What could be the reason for it ?

yifliu3 commented 2 months ago

I think that is caused by the unstable learning, and the network has been trapped in a local minimum. Typically this can be solved by tuning learning rates, you can have a try :)