Open xi-zc opened 3 weeks ago
Hi, thank you for the interests. Actually our cuda code in this repo is for general 3DGS and not specifically for RT-4DGS, which is slightly different with other 3DGS+deformation field designs (as you mentioned, RT-4DGS has additional time_stamp variable). Due to the policy issue and other submissions, I cannot provide the cuda code specifically for RT-4DGS as in our paper experiments at this time. But it is actually pretty straightforwrard to change the code here to meet RT-4DGS.
Thank you very much for your guidance and suggestions. I have been working on implementing GaussianFlow in RT-4DGS. Currently, the training speed is approximately the same as RT-4DGS, but the PSNR is lower. Additionally, the program consistently encounters an error at iteration=600, with the following error message:
Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA error: an illegal memory access was encountered
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/xzc/code/gsflow/train.py", line 227, in training
loss.backward()
File "/home/xzc/code/gsflow/train.py", line 479, in <module>
args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception has occurred: _LinAlgError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)`. This error may appear if the input matrix contains NaN.
File "/home/xzc/code/gsflow/scene/cameras.py", line 79, in get_rays
c2w = torch.linalg.inv(self.world_view_transform.transpose(0, 1))
File "/home/xzc/code/gsflow/gaussian_renderer/__init__.py", line 179, in render
rays_o, rays_d = viewpoint_camera.get_rays()
File "/home/xzc/code/gsflow/train.py", line 112, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/xzc/code/gsflow/train.py", line 479, in <module>
args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)`. This error may appear if the input matrix contains NaN.
I have also sent a more detailed email to your inbox and look forward to your response.
Thank you very much for your response. Following your guidance, I removed the following lines of code, for (int ch_var = 0; ch_var < 20; ch_var++)
{
// atomicAdd(&(dummy_gs_per_pixel[global_id * 20 + ch_var]), 0.0);
atomicAdd(&(dummy_weight_per_gs_pixel[global_id * 20 + ch_var]), alpha * T);
}
which resolved the issue encountered during iteration = 600. However, I am still encountering the following problem at iteration = 1000:
Exception has occurred: OutOfMemoryError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA out of memory. Tried to allocate 2.79 GiB (GPU 0; 23.65 GiB total capacity; 15.47 GiB already allocated; 2.07 GiB free; 20.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 441, in cat_tensors_to_optimizer
stored_state["exp_avg_sq"] = torch.cat((stored_state["exp_avg_sq"], torch.zeros_like(extension_tensor)), dim=0)
File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 476, in densification_postfix
optimizable_tensors = self.cat_tensors_to_optimizer(d)
File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 563, in densify_and_clone
self.densification_postfix(new_xyz, new_features_dc, new_features_rest, new_opacities, new_scaling, new_rotation, new_t, new_scaling_t, new_rotation_r)
File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 575, in densify_and_prune
self.densify_and_clone(grads, max_grad, extent, grads_t, max_grad_t)
File "/home/xzc/code/gsflow/train.py", line 305, in training
gaussians.densify_and_prune(opt.densify_grad_threshold, opt.thresh_opa_prune, scene.cameras_extent, size_threshold, opt.densify_grad_t_threshold)
File "/home/xzc/code/gsflow/train.py", line 480, in <module>
args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.79 GiB (GPU 0; 23.65 GiB total capacity; 15.47 GiB already allocated; 2.07 GiB free; 20.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am using an RTX 4090 GPU. To address the issue of insufficient memory, I modified the following function, but the memory shortage problem persists.
def cat_tensors_to_optimizer(self, tensors_dict):
optimizable_tensors = {}
for group in self.optimizer.param_groups:
assert len(group["params"]) == 1
extension_tensor = tensors_dict[group["name"]]
stored_state = self.optimizer.state.get(group['params'][0], None)
if stored_state is not None:
torch.cuda.empty_cache()
new_params = torch.cat((group["params"][0], extension_tensor), dim=0)
stored_state["exp_avg"] = torch.cat((stored_state["exp_avg"], torch.zeros_like(extension_tensor)), dim=0)
stored_state["exp_avg_sq"] = torch.cat((stored_state["exp_avg_sq"], torch.zeros_like(extension_tensor)), dim=0)
del self.optimizer.state[group['params'][0]]
group["params"][0] = nn.Parameter(new_params.requires_grad_(True))
self.optimizer.state[group['params'][0]] = stored_state
optimizable_tensors[group["name"]] = group["params"][0]
else:
torch.cuda.empty_cache()
group["params"][0] = nn.Parameter(torch.cat((group["params"][0], extension_tensor), dim=0).requires_grad_(True))
optimizable_tensors[group["name"]] = group["params"][0]
torch.cuda.empty_cache()
return optimizable_tensors
I look forward to your further guidance. Thank you very much.
The previous issue has been resolved. The cause of the out-of-memory problem was the continuous increase in the number of Gaussian kernels, which exceeded 3,000,000 by iteration 3000, resulting in insufficient GPU memory. I have addressed this by modifying the densify_grad_threshold to 0.35, thereby limiting the number of Gaussian kernels to 1,000,000. Could you please provide information on the parameters you used in RT-4DGS, including flow_thresh, flow_weight, and other settings in config/dynerf/.yaml? Additionally, have there been any modifications to the optimizer or scheduler? I appreciate your guidance and assistance.
Thank you for your outstanding work. I have attempted to apply your methods as described in the paper to RT-4DGS, specifically in the following segment: ` render_pkg = render(viewpoint_cam, gaussians, pipe, background) image, viewspace_point_tensor, visibility_filter, radii = render_pkg["render"], render_pkg["viewspace_points"], render_pkg["visibility_filter"], render_pkg["radii"] depth = render_pkg["depth"] alpha = render_pkg["alpha"] Ll1 = l1_loss(image, gt_image) Lssim = 1.0 - ssim(image, gt_image) loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim Lssim
Optical Flow Loss
if opt.lambda_flow > 0: start_time = time.time()
We detach the variables related to t_1 in GaussianFlow calculation, ensuring the gradient backward
Optical Flow Loss
` As the code illustrates, render calls rasterize_gaussians from RT-4DGS, while flow_render calls rasterize_gaussians from your provided implementation. This distinction is necessary because the rasterize_gaussians in your implementation does not pass a time_stamp, which leads to gaussian._t.grad being None and hence prevents backpropagation.
The issues I am currently facing are as follows:
Traceback (most recent call last): File "train.py", line 507, in <module> args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size) File "train.py", line 145, in training render_t_1 = flow_render(viewpoint_cam, gaussians, pipe, background) File "/home/xzc/code/gsflow/render_flow.py", line 165, in flow_render rays_o, rays_d = viewpoint_camera.get_rays() File "/home/xzc/code/gsflow/scene/cameras.py", line 80, in get_rays c2w = torch.linalg.inv(self.world_view_transform.transpose(0, 1)) torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling
cusolverDnSgetrf(handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info). This error may appear if the input matrix contains NaN.
This error typically occurs around iteration 1000.Runtime Errors: File "train.py", line 507, in <module> args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size) File "train.py", line 261, in training batch_viewspace_point_grad[visibility_filter] = batch_viewspace_point_grad[visibility_filter] * batch_size / visibility_count[visibility_filter] RuntimeError: numel: integer multiplication overflow or RuntimeError: CUDA error: an illegal memory access was encountered
These errors usually occur between iterations 900 and 1000.PSNR Performance: The PSNR is currently lower than that of RT-4DGS at the same training stage, and the training time is twice as long.
I am unsure whether the current implementation is correct and would greatly appreciate your guidance.