xi-zc commented 3 weeks ago

Thank you for your outstanding work. I have attempted to apply your methods as described in the paper to RT-4DGS, specifically in the following segment: ` render_pkg = render(viewpoint_cam, gaussians, pipe, background) image, viewspace_point_tensor, visibility_filter, radii = render_pkg["render"], render_pkg["viewspace_points"], render_pkg["visibility_filter"], render_pkg["radii"] depth = render_pkg["depth"] alpha = render_pkg["alpha"] Ll1 = l1_loss(image, gt_image) Lssim = 1.0 - ssim(image, gt_image) loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim Lssim

Optical Flow Loss

if opt.lambda_flow > 0: start_time = time.time()

We detach the variables related to t_1 in GaussianFlow calculation, ensuring the gradient backward

# only updates variables at t_2, leaving t_1 variables unchanged, as they were updated in t_1 - 1 using the same logic.

# This approach accelerates the training process by reducing the number of variables to update. Note that not detaching 
# variables at t_1 does not degrade performance but will slow down training.

# Gaussian parameters at t_1
render_t_1 = flow_render(viewpoint_cam, gaussians, pipe, background)
proj_2D_t_1 = render_t_1["proj_2D"]
gs_per_pixel = render_t_1["gs_per_pixel"].long() 
weight_per_gs_pixel = render_t_1["weight_per_gs_pixel"]
x_mu = render_t_1["x_mu"]
cov2D_inv_t_1 = render_t_1["conic_2D"].detach()

if 1 in viewpoint_cam.apflame:
    viewpoint_cam_t2 = training_dataset.viewpoint_stack[uid_dict[viewpoint_cam.apflame[1]]].cuda()
    render_t_2 = flow_render(viewpoint_cam_t2, gaussians, pipe, background)

    # Gaussian parameters at t_2
    proj_2D_t_2 = render_t_2["proj_2D"]
    cov2D_inv_t_2 = render_t_2["conic_2D"]
    cov2D_t_2 = render_t_2["conic_2D_inv"]

    cov2D_t_2_mtx = torch.stack([torch.cat([cov2D_t_2[:, 0:1], cov2D_t_2[:, 1:2]], dim=-1), torch.cat([cov2D_t_2[:, 1:2], cov2D_t_2[:, 2:3]], dim=-1)], dim=-2).cuda()

    cov2D_inv_t_1_mtx = torch.stack([torch.cat([cov2D_inv_t_1[:, 0:1], cov2D_inv_t_1[:, 1:2]], dim=-1), torch.cat([cov2D_inv_t_1[:, 1:2], cov2D_inv_t_1[:, 2:3]], dim=-1)], dim=-2).cuda()

    # B_t_2
    U_t_2, S_t_2, V_t_2 = torch.svd(cov2D_t_2_mtx)
    B_t_2 = torch.bmm(torch.bmm(U_t_2, torch.diag_embed(S_t_2)**(1/2)), V_t_2.transpose(1,2))

    # B_t_1 ^(-1)
    U_inv_t_1, S_inv_t_1, V_inv_t_1 = torch.svd(cov2D_inv_t_1_mtx)
    B_inv_t_1 = torch.bmm(torch.bmm(U_inv_t_1, torch.diag_embed(S_inv_t_1)**(1/2)), V_inv_t_1.transpose(1,2))

    # calculate B_t_2 * B_inv_t_1
    B_t_2_B_inv_t_1 = torch.bmm(B_t_2, B_inv_t_1)

    # Isotropic version of GaussianFlow
    # predicted_flow_by_gs = (proj_2D_next[gs_per_pixel] - proj_2D[gs_per_pixel].detach()) * weights.detach()

    # Full formulation of GaussianFlow
    cov_multi = (B_t_2_B_inv_t_1[gs_per_pixel] @ x_mu.permute(0,2,3,1).unsqueeze(-1).detach()).squeeze()
    predicted_flow_by_gs = (cov_multi + proj_2D_t_2[gs_per_pixel] - proj_2D_t_1[gs_per_pixel].detach() - x_mu.permute(0,2,3,1).detach()) * weight_per_gs_pixel.unsqueeze(-1).detach()

    # Flow supervision loss
    flow_thresh = 0.1
    if 1 in optical_flows:
        optical_flows_next = optical_flows[1].cuda()
    large_motion_msk = torch.norm(optical_flows_next, p=2, dim=-1) >= flow_thresh  
    # flow_thresh = 0.1 or another value to filter out noise, assuming pre-computed optical flow as pseudo GT
    Lflow = torch.norm((optical_flows_next - predicted_flow_by_gs.sum(0))[large_motion_msk], p=2, dim=-1).mean()
    loss = loss + opt.lambda_flow * Lflow  # flow_weight can be adjusted (e.g., 1, 0.1)

Optical Flow Loss

` As the code illustrates, render calls rasterize_gaussians from RT-4DGS, while flow_render calls rasterize_gaussians from your provided implementation. This distinction is necessary because the rasterize_gaussians in your implementation does not pass a time_stamp, which leads to gaussian._t.grad being None and hence prevents backpropagation.

The issues I am currently facing are as follows:

Traceback (most recent call last): File "train.py", line 507, in <module> args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size) File "train.py", line 145, in training render_t_1 = flow_render(viewpoint_cam, gaussians, pipe, background) File "/home/xzc/code/gsflow/render_flow.py", line 165, in flow_render rays_o, rays_d = viewpoint_camera.get_rays() File "/home/xzc/code/gsflow/scene/cameras.py", line 80, in get_rays c2w = torch.linalg.inv(self.world_view_transform.transpose(0, 1)) torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when callingcusolverDnSgetrf(handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info). This error may appear if the input matrix contains NaN. This error typically occurs around iteration 1000.

Runtime Errors: File "train.py", line 507, in <module> args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size) File "train.py", line 261, in training batch_viewspace_point_grad[visibility_filter] = batch_viewspace_point_grad[visibility_filter] * batch_size / visibility_count[visibility_filter] RuntimeError: numel: integer multiplication overflow or RuntimeError: CUDA error: an illegal memory access was encountered These errors usually occur between iterations 900 and 1000.
PSNR Performance: The PSNR is currently lower than that of RT-4DGS at the same training stage, and the training time is twice as long.

I am unsure whether the current implementation is correct and would greatly appreciate your guidance.

Zerg-Overmind commented 3 weeks ago

Hi, thank you for the interests. Actually our cuda code in this repo is for general 3DGS and not specifically for RT-4DGS, which is slightly different with other 3DGS+deformation field designs (as you mentioned, RT-4DGS has additional time_stamp variable). Due to the policy issue and other submissions, I cannot provide the cuda code specifically for RT-4DGS as in our paper experiments at this time. But it is actually pretty straightforwrard to change the code here to meet RT-4DGS.

xi-zc commented 2 weeks ago

Thank you very much for your guidance and suggestions. I have been working on implementing GaussianFlow in RT-4DGS. Currently, the training speed is approximately the same as RT-4DGS, but the PSNR is lower. Additionally, the program consistently encounters an error at iteration=600, with the following error message:

Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA error: an illegal memory access was encountered
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/xzc/code/gsflow/train.py", line 227, in training
loss.backward()
File "/home/xzc/code/gsflow/train.py", line 479, in <module>
args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
RuntimeError: CUDA error: an illegal memory access was encountered

Exception has occurred: _LinAlgError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)`. This error may appear if the input matrix contains NaN.
File "/home/xzc/code/gsflow/scene/cameras.py", line 79, in get_rays
c2w = torch.linalg.inv(self.world_view_transform.transpose(0, 1))
File "/home/xzc/code/gsflow/gaussian_renderer/__init__.py", line 179, in render
rays_o, rays_d = viewpoint_camera.get_rays()
File "/home/xzc/code/gsflow/train.py", line 112, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/xzc/code/gsflow/train.py", line 479, in <module>
args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)`. This error may appear if the input matrix contains NaN.

I have also sent a more detailed email to your inbox and look forward to your response.

xi-zc commented 2 weeks ago

Thank you very much for your response. Following your guidance, I removed the following lines of code, for (int ch_var = 0; ch_var < 20; ch_var++)

 {
 // atomicAdd(&(dummy_gs_per_pixel[global_id * 20 + ch_var]), 0.0);
 atomicAdd(&(dummy_weight_per_gs_pixel[global_id * 20 + ch_var]), alpha * T);
 }

which resolved the issue encountered during iteration = 600. However, I am still encountering the following problem at iteration = 1000:

Exception has occurred: OutOfMemoryError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA out of memory. Tried to allocate 2.79 GiB (GPU 0; 23.65 GiB total capacity; 15.47 GiB already allocated; 2.07 GiB free; 20.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 441, in cat_tensors_to_optimizer
    stored_state["exp_avg_sq"] = torch.cat((stored_state["exp_avg_sq"], torch.zeros_like(extension_tensor)), dim=0)
  File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 476, in densification_postfix
    optimizable_tensors = self.cat_tensors_to_optimizer(d)
  File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 563, in densify_and_clone
    self.densification_postfix(new_xyz, new_features_dc, new_features_rest, new_opacities, new_scaling, new_rotation, new_t, new_scaling_t, new_rotation_r)
  File "/home/xzc/code/gsflow/scene/gaussian_model.py", line 575, in densify_and_prune
    self.densify_and_clone(grads, max_grad, extent, grads_t, max_grad_t)
  File "/home/xzc/code/gsflow/train.py", line 305, in training
    gaussians.densify_and_prune(opt.densify_grad_threshold, opt.thresh_opa_prune, scene.cameras_extent, size_threshold, opt.densify_grad_t_threshold)
  File "/home/xzc/code/gsflow/train.py", line 480, in <module>
    args.gaussian_dim, args.time_duration, args.num_pts, args.num_pts_ratio, args.rot_4d, args.force_sh_3d, args.batch_size)
  File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xzc/anaconda3/envs/4dgs/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
    "__main__", mod_spec)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.79 GiB (GPU 0; 23.65 GiB total capacity; 15.47 GiB already allocated; 2.07 GiB free; 20.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am using an RTX 4090 GPU. To address the issue of insufficient memory, I modified the following function, but the memory shortage problem persists.

    def cat_tensors_to_optimizer(self, tensors_dict):
        optimizable_tensors = {}
        for group in self.optimizer.param_groups:
            assert len(group["params"]) == 1
            extension_tensor = tensors_dict[group["name"]]
            stored_state = self.optimizer.state.get(group['params'][0], None)
            if stored_state is not None:
                torch.cuda.empty_cache()
                new_params = torch.cat((group["params"][0], extension_tensor), dim=0)
                stored_state["exp_avg"] = torch.cat((stored_state["exp_avg"], torch.zeros_like(extension_tensor)), dim=0)
                stored_state["exp_avg_sq"] = torch.cat((stored_state["exp_avg_sq"], torch.zeros_like(extension_tensor)), dim=0)
                del self.optimizer.state[group['params'][0]]
                group["params"][0] = nn.Parameter(new_params.requires_grad_(True))
                self.optimizer.state[group['params'][0]] = stored_state
                optimizable_tensors[group["name"]] = group["params"][0]
            else:
                torch.cuda.empty_cache()
                group["params"][0] = nn.Parameter(torch.cat((group["params"][0], extension_tensor), dim=0).requires_grad_(True))
                optimizable_tensors[group["name"]] = group["params"][0]
            torch.cuda.empty_cache()
        return optimizable_tensors

I look forward to your further guidance. Thank you very much.

xi-zc commented 2 weeks ago

The previous issue has been resolved. The cause of the out-of-memory problem was the continuous increase in the number of Gaussian kernels, which exceeded 3,000,000 by iteration 3000, resulting in insufficient GPU memory. I have addressed this by modifying the densify_grad_threshold to 0.35, thereby limiting the number of Gaussian kernels to 1,000,000. Could you please provide information on the parameters you used in RT-4DGS, including flow_thresh, flow_weight, and other settings in config/dynerf/.yaml? Additionally, have there been any modifications to the optimizer or scheduler? I appreciate your guidance and assistance.

Zerg-Overmind / GaussianFlow

Bugs and error during training #13

Optical Flow Loss

We detach the variables related to t_1 in GaussianFlow calculation, ensuring the gradient backward

Optical Flow Loss