fudan-zvg / 4d-gaussian-splatting

[ICLR 2024] Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting
MIT License
603 stars 44 forks source link

CUDA error in training #21

Closed fishfishson closed 8 months ago

fishfishson commented 8 months ago

Hi author,

Thx for your great repo about dynamic rendering. I tried to train 'cut_roasted_beef' scene with your code but got the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

The full training log is:

Using /dellnas/home/4dg/.cache/torch_extensions/py37_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /dellnas/home/4dg/.cache/torch_extensions/py37_cu116/diff_gaussian_rasterization/build.ninja... Building extension module diff_gaussian_rasterization... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module diff_gaussian_rasterization... Optimizing output/N3V/cut_roasted_beef Output folder: output/N3V/cut_roasted_beef [21/02 15:49:42] Tensorboard not available: not logging progress [21/02 15:49:42] Found transforms_train.json file, assuming Blender data set! [21/02 15:49:42] Reading Training Transforms [21/02 15:49:42] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5700/5700 [00:01<00:00, 3104.34it/s] Reading Test Transforms [21/02 15:49:44] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 2471.07it/s] Loading Training Cameras [21/02 15:49:44] Loading Test Cameras [21/02 15:49:45] Number of points at initialisation : 300000 [21/02 15:49:46] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.85it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.98it/s] [ITER 500] Evaluating train: L1 0.027832254767417908 PSNR 27.3373046875 [21/02 15:53:44] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [01:07<00:00, 4.43it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [01:07<00:00, 4.75it/s] [ITER 500] Evaluating test: L1 0.02017407544578115 PSNR 29.758896795908612 [21/02 15:54:52] [ITER 500] Saving best checkpoint [21/02 15:54:52] Training progress: 2%|█ | 600/30000 [05:27<3:50:45, 2.12it/s, Loss=0.0107820, PSNR=26.70, Ll1=0.0282, Lssim=0.1027]Traceback (most recent call last): File "train.py", line 403, in args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size) File "train.py", line 240, in training gaussians.densify_and_prune(opt.densify_grad_threshold, opt.thresh_opa_prune, scene.cameras_extent, size_threshold, opt.densify_grad_t_threshold) File "/dellnas/home/4dg/project/4d-gaussian-splatting/scene/gaussian_model.py", line 563, in densify_and_prune self.densify_and_split(grads, max_grad, extent, grads_t, max_grad_t) File "/dellnas/home/4dg/project/4d-gaussian-splatting/scene/gaussian_model.py", line 516, in densify_and_split rots = build_rotation_4d(self._rotation[selected_pts_mask], self._rotation_r[selected_pts_mask]).repeat(N,1,1) File "/dellnas/home/4dg/project/4d-gaussian-splatting/utils/general_utils.py", line 131, in build_rotation_4d A = M_l @ M_r RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches) Training progress: 2%|█ | 600/30000 [05:29<4:29:03, 1.82it/s, Loss=0.0107820, PSNR=26.70, Ll1=0.0282, Lssim=0.1027]

It seems like the first epoch runs well but there exists something wrong in the second epoch. Cloud you pls help solve this problem?

fishfishson commented 8 months ago

It's my server's problem.

Dionysus326 commented 3 months ago

@fishfishson Hi, may I ask you how you solve your problem? And what is the problem of your server? Because I also encountered the same error after first epoch