Xharlie / pointnerf

Point-NeRF: Point-based Neural Radiance Fields
Other
1.1k stars 127 forks source link

Error on inference without per-scene optimization #1

Open lukemelas opened 2 years ago

lukemelas commented 2 years ago

Hello authors, thank you very much for your paper and for releasing code! I am very excited about this work.

I am trying to run inference on a model that has been trained for point initialization (not with per-scene optimization). The README indicates that it should be possible to perform inference on DUT with bash dev_scripts/dtu_test_inf/inftest_scan8.sh.

When I run this script, I get an error:

File "/path/to/pointnerf/run/../models/neural_points_volumetric_model.py", line 270, in forward
    decoded_features, ray_valid, weight, conf_coefficient = self.aggregator(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, grid_vox_sz)
...
File "/path/to/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta,
 c, ldc)`

I am using CUDA 11.1 and I believe that everything CUDA-related is installed correctly.

I was also confused by something else: this script calls train_ft.py, not test.py as I would have expected. In fact, there is not a single script in dev_scripts that calls test.py. Am I misunderstanding something here, or is test.py actually never used?

Xharlie commented 2 years ago

I'm not sure if it is the error because of pycuda, since in cuda, if a upper stream has error, it is usually shown as an error later. I have tried this release on multiple machines and it works well. Can anyone else post the same error if you also see this one? It will help the debugging.

lukemelas commented 2 years ago

Hello, thanks for the response!

I've switched to a new machine and installed pycuda from scratch. I'm now getting a different error, which appears to be an index error:

xyz_residual torch.Size([12867612])
min_idx torch.Size([554896])
after voxelize: torch.Size([554896, 3]) torch.Size([554896, 1])
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [123,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index <
sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [123,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index <
sizes[i] && "index out of bounds"` failed.
...

I traced this error back to the source, which is on line 143: https://github.com/Xharlie/pointnerf/blob/master/run/train_ft.py#L143

Here is the relevant code block:

        if opt.vox_res > 0:
            xyz_world_all, sparse_grid_idx, sampled_pnt_idx = mvs_utils.construct_vox_points_closest(xyz_world_all.cuda() if len(xyz_world_all) < 99999999 else xyz_world_all[::(len(xyz_world_all)//99999999+1),...].cuda(), opt.vox_res)
            points_vid = points_vid[sampled_pnt_idx,:]

Here, I find that the returned tensor sampled_pnt_idx has a maximum value which is equal to the size of points_vid, which results in the index error.

Here's my debugger output:

(Pdb++) sampled_pnt_idx.max()
tensor(12867603, device='cuda:0')
(Pdb++) points_vid.shape
torch.Size([12867603, 1])

Your help would be greatly appreciated!

Best, Luke

Xharlie commented 2 years ago

Hi, I just added the installation to install torch_scatter, I guess you have already installed it? This step is earlier than your first error, so maybe your first machine can pass it? have you check the difference?

lukemelas commented 2 years ago

Thanks for the quick response. Yes, I've installed torch_scatter. And unfortunately I'm still getting the other error on the other machine -- that machine has had CUDA problems in the past, so it's probably a CUDA thing there.

LeeHW-THU commented 2 years ago

@Xharlie @lukemelas Hi, I'm having the same problem when running 'dev_scripts/dtu_test_inf/inftest_scan8.sh' is there a solution yet ? RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) I am using pytorch 1.10.1+CUDA 11.3 and I believe that everything CUDA-related is installed correctly. I noticed you are using CUDA 10.2, is this a problem created by CUDA 11 ? I'm using rtx3060 and can only use CUDA 11...

Xharlie commented 2 years ago

hi, is it only for inftest_scan* scripts or also exist in other scripts? looks like other people can at least run the per-scene optimization scripts.

LeeHW-THU commented 2 years ago

Thanks for the quick response. In my operation, I have only found this problem in inftest_scan* scripts. By the way, When I run bash dev_scripts/w_n360/chair.sh, I get the following message and I am not sure if this is correct. It looks like a PyCUDA error, but the program doesn't break. Is this the error from generating vid?

`--------------------------------Finish Test Rendering--------------------------------                                                                                                                      
test id_list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 10
0, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 
141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181
, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199]                                                                                                                
../checkpoints/nerfsynth/chair/test_200000/images ../checkpoints/nerfsynth/chair/test_200000/images ../checkpoints/nerfsynth/chair/test_200000/images                                                      
step-%04d-coarse_raycolor.png step-%04d-gt_image.png                                                                                                                                                       
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]                                                                                                                                    
Loading model from: /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth                                                                                              
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]                                                                                                                                     
Loading model from: /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth                                                                                               
/home/lee/Desktop/pointNeRF_lab/pointnerf-master/pointnerf/run/../run/evaluate.py:11: FutureWarning: `multichannel` is a deprecated argument name for `structural_similarity`. It will be removed in versio
n 1.0. Please use `channel_axis` instead.                                                                                                                                                                  
  return structural_similarity(gt, img, win_size=win_size, multichannel=multichannel)                                                                                                                      
200 images computed                                                                                                                                                                                        
psnr: 35.615182                                                                                                                                                                                            
ssim: 0.991588                                                                                                                                                                                             
lpips: 0.009299                                                                                                                                                                                            
vgglpips: 0.021933                                                                                                                                                                                         
rmse: 0.016793                                                                                                                                                                                             

--------------------------------Finish Evaluation--------------------------------                                                                                                                          
--------------------------------Finish generating vid--------------------------------                                                                                                                      
test at iter 200000, PSNR: 35.63845443725586, best_PSNR: 35.63845443725586, best_iter: 200000                                                                                                              
end loading                                                                                                                                                                                                
end loading                                                                                                                                                                                                
-------------------------------------------------------------------                                                                                                                                        
PyCUDA ERROR: The context stack was not empty upon module cleanup.                                                                                                                                         
-------------------------------------------------------------------                                                                                                                                        
A context was still active when the context stack was being                                                                                                                                                
cleaned up. At this point in our execution, CUDA may already                                                                                                                                               
have been deinitialized, so there is no way we can finish                                                                                                                                                  
cleanly. The program will be aborted now.                                                                                                                                                                  
Use Context.pop() to avoid this problem.                                                                                                                                                                   
-------------------------------------------------------------------dev_scripts/w_n360/chair.sh: line 162: 1084863 Aborted                 (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --whi
ch_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --
random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu
_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_tes
t $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --
which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_di
r --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limi
t_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode
 $agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feat
ure_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature
_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_mod
el $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --co
lor_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_o
cc --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_gr
ad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir
_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --def
ault_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $
alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_it
er $prune_max_iter --far_thresh $far_thresh --debug                                                                                                                                                        
opt.color_loss_items  ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor']                                                                                                        
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                                                                                                                                             
Debug Mode                                                                                                                                                                                                 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-
or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.                                                 
  arrays = [asanyarray(arr) for arr in arrays]                                                                                                                                                             
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered i
nternally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)                                                                                                                                               
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]                                                                                                                                     
dataset total: train 100                                                                                                                                                                                   
dataset [NerfSynthFtDataset] was created                                                                                                                                                                   
../checkpoints/nerfsynth/chair/*_net_ray_marching.pth                                                                                                                                                      
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                                                                                                                                             
Continue training from 200000 epoch                                                                                                                                                                        
Iter: 200000                                                                                                                                                                                               
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                                                                                                                                             
opt.act_type!!!!!!!!! LeakyReLU                                                                                                                                                                            
self.points_embeding torch.Size([1, 694340, 32])                                                                                                                                                           
querier device cuda:0 0                                                                                                                                                                                    
neural_params [('module.neural_points.xyz', torch.Size([694340, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 694340, 32]), True), ('module.neural_points.points_conf', torch.Size([
1, 694340, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 694340, 3]), True), ('module.neural_points.points_color', torch.Size([1, 694340, 3]), True), ('module.neural_points.Rw2c', torch.
Size([3, 3]), False)]                                                                                                                                                                                      
model [MvsPointsVolumetricModel] was created                                                                                                                                                               
opt.resume_iter!!!!!!!!! 200000                                                                                                                                                                            
loading ray_marching  from  ../checkpoints/nerfsynth/chair/200000_net_ray_marching.pth                                                                                                                     
------------------- Networks -------------------                                                                                                                                                           
[Network ray_marching] Total number of parameters: 29.504M                                                                                                                                                 
------------------------------------------------                                                                                                                                                           
# training images = 100                                                                                                                                                                                    
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, y
ou should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more det
ails at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate                                                                                                                             
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
chair: End of stepts 200000 / 200000     Time Taken: 0.2245779037475586 sec
saving model (chair, epoch 1999, total_steps 200000)
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-
or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arrays = [asanyarray(arr) for arr in arrays]
dataset total: test 200
dataset [NerfSynthFtDataset] was created
full datasets test:
-----------------------------------Testing-----------------------------------
Xharlie commented 2 years ago

yea, i have some implementation issue with pycuda, but it's not an error. The pycuda device handle release problem is confusing to me. it seems your chair results are even better than mine. The pycuda pytorch integration is tricky. I think I have seen the cublasSgemm error before but I couldn't reconstruct the error on any of my machine now. The stack overflow suggests to add line 20 in the query_point_indices.py file, but i found no needs so i commented it out. you can also uncomment torch.cuda.synchronize() in its query_grid_point_index function, to see which step the pycuda breaks.

LeeHW-THU commented 2 years ago

I have uncomment torch.cuda.synchronize() in query_grid_point_index function, but it doesn't seem to solve the problem, is this normal? When I run the bash dev_scripts/w_n360/chair.sh command, Its output is shown below :

`++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
-----------------------------------Generate Points-----------------------------------
loading model ../checkpoints/MVSNet/model_000014.ckpt
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! best
loading mvs  from  ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_mvs.pth
------------------- Networks -------------------
[Network mvs] Total number of parameters: 0.382M
0%|          | 1/555 [00:01<13:52,  1.50s/it]
100%|██████████| 555/555 [03:31<00:00,  2.62it/s]
 0%|          | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [17:36<00:00,  1.90s/it]
xyz_world_all torch.Size([97493176, 3]) torch.Size([97493176, 1]) torch.Size([97493176])
%%%%%%%%%%%%%  getattr(dataset, spacemin, None) None
vishull_mask torch.Size([97493176])
alpha masking xyz_world_all torch.Size([34014530, 3]) torch.Size([34014530, 1])
xyz_residual torch.Size([34014530])
min_idx torch.Size([538111])
after voxelize: torch.Size([538111, 3]) torch.Size([538111, 1])

  0%|          | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [00:07<00:00, 72.08it/s]
self.model_names ['mvs']
opt.act_type!!!!!!!!! LeakyReLU
querier device cuda:0 0
no neural points as nn.Parameter
model [MvsPointsVolumetricModel] was created
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
opt.resume_iter!!!!!!!!! best
loading ray_marching  from  ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)

Traceback (most recent call last):
  File "train_ft.py", line 1084, in <module>
    main()
  File "train_ft.py", line 940, in main
    model.optimize_parameters(total_steps=total_steps)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 215, in optimize_parameters
    self.forward()
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/mvs_points_volumetric_model.py", line 126, in forward
    self.output = self.run_network_models()
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 85, in run_network_models
    return self.fill_invalid(self.net_ray_marching(**self.input), self.input)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 270, in forward
    decoded_features, ray_valid, weight, conf_coefficient = self.aggregator(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, grid_vox_sz)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 811, in forward
    output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp
    alpha = self.raw2out_density(self.alpha_branch(alpha_in))
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
end loading
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
dev_scripts/w_n360/drums.sh: line 164: 1184025 Aborted                 (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --which_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_test $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_dir --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limit_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode $agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feature_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_model $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --color_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_occ --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_grad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --default_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_iter $prune_max_iter --far_thresh $far_thresh --debug
opt.color_loss_items  ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Continue training from 0 epoch
Iter: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
opt.act_type!!!!!!!!! LeakyReLU
self.points_embeding torch.Size([1, 538111, 32])
querier device cuda:0 0
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! 0
loading ray_marching  from  ../checkpoints/nerfsynth/drums/0_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arrays = [asanyarray(arr) for arr in arrays]
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
optimizer 1, learning rate = 0.0005000
optimizer 2, learning rate = 0.0019998
End of iteration 40      Number of batches 40    Time taken: 7.17s
[Average Loss] total: 0.0422415882   ray_masked_coarse_raycolor: 0.0424671248   ray_masked_coarse_raycolor_psnr: 13.9493732452   ray_miss_coarse_raycolor: 0.0365000255   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0117533728   coarse_raycolor_psnr: 19.4890022278   conf_coefficient: -2.2853794098   
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019996
End of iteration 80      Number of batches 40    Time taken: 6.70s
[Average Loss] total: 0.0286709573   ray_masked_coarse_raycolor: 0.0291084927   ray_masked_coarse_raycolor_psnr: 15.4500093460   ray_miss_coarse_raycolor: 0.0000000000   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0080277314   coarse_raycolor_psnr: 21.0282936096   conf_coefficient: -4.4054217339   
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019994
End of iteration 120     Number of batches 40    Time taken: 7.11s
[Average Loss] total: 0.0262186974   ray_masked_coarse_raycolor: 0.0267978553   ray_masked_coarse_raycolor_psnr: 15.8921508789   ray_miss_coarse_raycolor: 0.0000593065   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0075231828   coarse_raycolor_psnr: 21.3510837555   conf_coefficient: -5.8215517998   `
Xharlie commented 2 years ago

im not sure, looks like a pycuda thing. but the optimization afterwards seem still working well.

LeeHW-THU commented 2 years ago

Thank you for your careful reply, I will ignore this PyCUDA message if it has no impact. Also, if you have time, could you update the description of each parser argument? Some of the 'help' seems to be a direct copy and paste. This will help us to explore PointNeRF more deeply and to make changes in the code. Thank you!

Xharlie commented 2 years ago

Sure, good suggestions

huiqing-su commented 2 years ago

Hi. Have you solved this problem yet? I have met the same problem, RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc), I am using pytorch 1.10.0+CUDA 11.3, I don't know if it was caused by pytorch.

Xharlie commented 2 years ago

hi, i think pycuda is the cause of this error. you can try build from source

huiqing-su commented 2 years ago

Thanks for your reply, when I replaced my graphics card and installed Pytorch==1.8.1 and CUDA==10.2, it worked well. By the way, I have another question, I want to use the data-set collected by myself for testing. However, I don't know how to write /data/dtu_configs/pairs.th in your program. I am looking forward to your reply.

huiqing-su commented 2 years ago

Thanks for your reply, when I replaced my graphics card and installed Pytorch==1.8.1 and CUDA==10.2, it worked well. By the way, I have another question, I want to use the data-set collected by myself for testing. However, I don't know how to write /data/dtu_configs/pairs.th in your program. I am looking forward to your reply.

And I really want to know how you can get these self.pair_idx. > /home/pth-algo/Desktop/pointnerf-master/data/dtu_ft_dataset.py(110)initialize() -> print("dtu_ft train id", self.pair_idx[0]) (Pdb) self.pair_idx [[25, 21, 33, 22, 14, 15, 26, 30, 31, 35, 34, 43, 46, 29, 16, 36], [32, 24, 23, 44]]

Xharlie commented 2 years ago

hi these pairs are from MVSNeRF / MVSNet . i have no idea why they use these, you can ask them about it

huiqing-su commented 2 years ago

Thanks for your quick reply, in MVSNeRF, I have found a "Pairs generation" section in the "renderer.ipynb" for the pairing generation.

ttaa9 commented 2 years ago

I am also getting a similar error to the others above:

Traceback (most recent call last): File "train_ft.py", line 1094, in main() File "train_ft.py", line 826, in main test(model, test_dataset, Visualizer(test_opt), test_opt, test_bg_info, test_steps=total_steps) File "train_ft.py", line 315, in test model.test() File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/mvs_points_volumetric_model.py", line 335, in test self.output = self.run_network_models() File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/neural_points_volumetric_model.py", line 88, in run_network_models return self.fill_invalid(self.net_ray_marching(self.input), self.input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(inputs[0], kwargs[0]) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/neural_points_volumetric_model.py", line 419, in forward sample_ray_dirs, vsize, grid_vox_sz) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/aggregators/pointaggregators.py", line 824, in forward output, = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp temp = self.color_branch(color_in) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

I have tried with pytorch 1.10.0 + cuda 11.3, and also with pytorch 1.8.1 + cuda 10.2, but get the same error in the same place. I have also, in both cases, built and installed pycuda manually.

This only happens to me with the DTU scripts. I can run the w_n360 scripts just fine.

Any suggestions?

QiukuZ commented 2 years ago

I have uncomment torch.cuda.synchronize() in query_grid_point_index function, but it doesn't seem to solve the problem, is this normal? When I run the bash dev_scripts/w_n360/chair.sh command, Its output is shown below :

`++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
-----------------------------------Generate Points-----------------------------------
loading model ../checkpoints/MVSNet/model_000014.ckpt
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! best
loading mvs  from  ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_mvs.pth
------------------- Networks -------------------
[Network mvs] Total number of parameters: 0.382M
0%|          | 1/555 [00:01<13:52,  1.50s/it]
100%|██████████| 555/555 [03:31<00:00,  2.62it/s]
 0%|          | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [17:36<00:00,  1.90s/it]
xyz_world_all torch.Size([97493176, 3]) torch.Size([97493176, 1]) torch.Size([97493176])
%%%%%%%%%%%%%  getattr(dataset, spacemin, None) None
vishull_mask torch.Size([97493176])
alpha masking xyz_world_all torch.Size([34014530, 3]) torch.Size([34014530, 1])
xyz_residual torch.Size([34014530])
min_idx torch.Size([538111])
after voxelize: torch.Size([538111, 3]) torch.Size([538111, 1])

  0%|          | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [00:07<00:00, 72.08it/s]
self.model_names ['mvs']
opt.act_type!!!!!!!!! LeakyReLU
querier device cuda:0 0
no neural points as nn.Parameter
model [MvsPointsVolumetricModel] was created
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
opt.resume_iter!!!!!!!!! best
loading ray_marching  from  ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)

Traceback (most recent call last):
  File "train_ft.py", line 1084, in <module>
    main()
  File "train_ft.py", line 940, in main
    model.optimize_parameters(total_steps=total_steps)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 215, in optimize_parameters
    self.forward()
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/mvs_points_volumetric_model.py", line 126, in forward
    self.output = self.run_network_models()
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 85, in run_network_models
    return self.fill_invalid(self.net_ray_marching(**self.input), self.input)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 270, in forward
    decoded_features, ray_valid, weight, conf_coefficient = self.aggregator(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, grid_vox_sz)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 811, in forward
    output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists)
  File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp
    alpha = self.raw2out_density(self.alpha_branch(alpha_in))
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
end loading
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
dev_scripts/w_n360/drums.sh: line 164: 1184025 Aborted                 (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --which_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_test $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_dir --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limit_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode $agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feature_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_model $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --color_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_occ --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_grad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --default_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_iter $prune_max_iter --far_thresh $far_thresh --debug
opt.color_loss_items  ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor']
�[1;31;48m++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++�[1;37;0m
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Continue training from 0 epoch
Iter: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
opt.act_type!!!!!!!!! LeakyReLU
self.points_embeding torch.Size([1, 538111, 32])
querier device cuda:0 0
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! 0
loading ray_marching  from  ../checkpoints/nerfsynth/drums/0_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arrays = [asanyarray(arr) for arr in arrays]
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
optimizer 1, learning rate = 0.0005000
optimizer 2, learning rate = 0.0019998
End of iteration 40    Number of batches 40    Time taken: 7.17s
[Average Loss] total: 0.0422415882   ray_masked_coarse_raycolor: 0.0424671248   ray_masked_coarse_raycolor_psnr: 13.9493732452   ray_miss_coarse_raycolor: 0.0365000255   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0117533728   coarse_raycolor_psnr: 19.4890022278   conf_coefficient: -2.2853794098   
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019996
End of iteration 80    Number of batches 40    Time taken: 6.70s
[Average Loss] total: 0.0286709573   ray_masked_coarse_raycolor: 0.0291084927   ray_masked_coarse_raycolor_psnr: 15.4500093460   ray_miss_coarse_raycolor: 0.0000000000   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0080277314   coarse_raycolor_psnr: 21.0282936096   conf_coefficient: -4.4054217339   
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019994
End of iteration 120   Number of batches 40    Time taken: 7.11s
[Average Loss] total: 0.0262186974   ray_masked_coarse_raycolor: 0.0267978553   ray_masked_coarse_raycolor_psnr: 15.8921508789   ray_miss_coarse_raycolor: 0.0000593065   ray_miss_coarse_raycolor_psnr: inf   coarse_raycolor: 0.0075231828   coarse_raycolor_psnr: 21.3510837555   conf_coefficient: -5.8215517998   `

I also encountered the same problem, I tried changing the version of torch/cuda, and changing the machine, and I also checked the shape of the input, but it didn't solve the problem. In my tests, this error occurs randomly: the input subsequence tends to be run correctly. Therefore, I use the following method to replace the original code, although it will lose a certain speed, but the code can be guaranteed to run.

in models/point raw:

color_output = self.raw2out_color(self.color_branch(color_in))

After:

try:
    color_output = self.raw2out_color(self.color_branch(color_in))
    cal_fail = False
except:
    cal_fail = True

while cal_fail:
    max_iter = int(color_in.shape[0] * torch.rand(1))
    try:
        color_output_part = []
        for i_start in range(0, color_in.shape[0], max_iter):
            i_end = min(color_in.shape[0], i_start + max_iter)
            color_output_part.append(self.raw2out_color(self.color_branch(color_in[i_start:i_end])))
        color_output = torch.cat(color_output_part, dim=0)
        cal_fail = False
    except:
        cal_fail = True

Since this error( cublasSgemm ) occurs randomly, I think this aspect can be avoided in all places where this error occurs. Of course, this should not be the most elegant solution, looking forward to a better solution.

zhangchuanyi96 commented 2 years ago

I also come to CUBLAS_STATUS_EXECUTION_FAILED problem. Really need help.

Although I tried code

try:
    color_output = self.raw2out_color(self.color_branch(color_in))
    cal_fail = False
except:
    cal_fail = True

while cal_fail:
    max_iter = int(color_in.shape[0] * torch.rand(1))
    try:
        color_output_part = []
        for i_start in range(0, color_in.shape[0], max_iter):
            i_end = min(color_in.shape[0], i_start + max_iter)
            color_output_part.append(self.raw2out_color(self.color_branch(color_in[i_start:i_end])))
        color_output = torch.cat(color_output_part, dim=0)
        cal_fail = False
    except:
        cal_fail = True

Then the error became

File "/home/zhangchuanyi/pointnerf/run/../models/aggregators/point_aggregators.py", line 844, in forward
    output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists)
  File "/home/zhangchuanyi/pointnerf/run/../models/aggregators/point_aggregators.py", line 547, in viewmlp
    feat = self.block1(feat)
  File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
zhanghm1995 commented 1 year ago

@zhangchuanyi96 I still encounter this problem, have you solved this issue?