Open lukemelas opened 2 years ago
I'm not sure if it is the error because of pycuda, since in cuda, if a upper stream has error, it is usually shown as an error later. I have tried this release on multiple machines and it works well. Can anyone else post the same error if you also see this one? It will help the debugging.
Hello, thanks for the response!
I've switched to a new machine and installed pycuda
from scratch. I'm now getting a different error, which appears to be an index error:
xyz_residual torch.Size([12867612])
min_idx torch.Size([554896])
after voxelize: torch.Size([554896, 3]) torch.Size([554896, 1])
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [123,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index <
sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [123,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index <
sizes[i] && "index out of bounds"` failed.
...
I traced this error back to the source, which is on line 143: https://github.com/Xharlie/pointnerf/blob/master/run/train_ft.py#L143
Here is the relevant code block:
if opt.vox_res > 0:
xyz_world_all, sparse_grid_idx, sampled_pnt_idx = mvs_utils.construct_vox_points_closest(xyz_world_all.cuda() if len(xyz_world_all) < 99999999 else xyz_world_all[::(len(xyz_world_all)//99999999+1),...].cuda(), opt.vox_res)
points_vid = points_vid[sampled_pnt_idx,:]
Here, I find that the returned tensor sampled_pnt_idx
has a maximum value which is equal to the size of points_vid
, which results in the index error.
Here's my debugger output:
(Pdb++) sampled_pnt_idx.max()
tensor(12867603, device='cuda:0')
(Pdb++) points_vid.shape
torch.Size([12867603, 1])
Your help would be greatly appreciated!
Best, Luke
Hi, I just added the installation to install torch_scatter, I guess you have already installed it? This step is earlier than your first error, so maybe your first machine can pass it? have you check the difference?
Thanks for the quick response. Yes, I've installed torch_scatter
. And unfortunately I'm still getting the other error on the other machine -- that machine has had CUDA problems in the past, so it's probably a CUDA thing there.
@Xharlie @lukemelas
Hi, I'm having the same problem when running 'dev_scripts/dtu_test_inf/inftest_scan8.sh'
is there a solution yet ?
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I am using pytorch 1.10.1+CUDA 11.3 and I believe that everything CUDA-related is installed correctly.
I noticed you are using CUDA 10.2, is this a problem created by CUDA 11 ? I'm using rtx3060 and can only use CUDA 11...
hi, is it only for inftest_scan* scripts or also exist in other scripts? looks like other people can at least run the per-scene optimization scripts.
Thanks for the quick response.
In my operation, I have only found this problem in inftest_scan* scripts.
By the way, When I run bash dev_scripts/w_n360/chair.sh
, I get the following message and I am not sure if this is correct. It looks like a PyCUDA error, but the program doesn't break. Is this the error from generating vid?
`--------------------------------Finish Test Rendering--------------------------------
test id_list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 10
0, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181
, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199]
../checkpoints/nerfsynth/chair/test_200000/images ../checkpoints/nerfsynth/chair/test_200000/images ../checkpoints/nerfsynth/chair/test_200000/images
step-%04d-coarse_raycolor.png step-%04d-gt_image.png
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
Loading model from: /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth
/home/lee/Desktop/pointNeRF_lab/pointnerf-master/pointnerf/run/../run/evaluate.py:11: FutureWarning: `multichannel` is a deprecated argument name for `structural_similarity`. It will be removed in versio
n 1.0. Please use `channel_axis` instead.
return structural_similarity(gt, img, win_size=win_size, multichannel=multichannel)
200 images computed
psnr: 35.615182
ssim: 0.991588
lpips: 0.009299
vgglpips: 0.021933
rmse: 0.016793
--------------------------------Finish Evaluation--------------------------------
--------------------------------Finish generating vid--------------------------------
test at iter 200000, PSNR: 35.63845443725586, best_PSNR: 35.63845443725586, best_iter: 200000
end loading
end loading
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------dev_scripts/w_n360/chair.sh: line 162: 1084863 Aborted (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --whi
ch_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --
random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu
_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_tes
t $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --
which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_di
r --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limi
t_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode
$agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feat
ure_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature
_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_mod
el $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --co
lor_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_o
cc --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_gr
ad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir
_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --def
ault_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $
alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_it
er $prune_max_iter --far_thresh $far_thresh --debug
opt.color_loss_items ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-
or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arrays = [asanyarray(arr) for arr in arrays]
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered i
nternally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/chair/*_net_ray_marching.pth
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Continue training from 200000 epoch
Iter: 200000
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
opt.act_type!!!!!!!!! LeakyReLU
self.points_embeding torch.Size([1, 694340, 32])
querier device cuda:0 0
neural_params [('module.neural_points.xyz', torch.Size([694340, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 694340, 32]), True), ('module.neural_points.points_conf', torch.Size([
1, 694340, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 694340, 3]), True), ('module.neural_points.points_color', torch.Size([1, 694340, 3]), True), ('module.neural_points.Rw2c', torch.
Size([3, 3]), False)]
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! 200000
loading ray_marching from ../checkpoints/nerfsynth/chair/200000_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 29.504M
------------------------------------------------
# training images = 100
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, y
ou should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more det
ails at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
chair: End of stepts 200000 / 200000 Time Taken: 0.2245779037475586 sec
saving model (chair, epoch 1999, total_steps 200000)
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-
or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arrays = [asanyarray(arr) for arr in arrays]
dataset total: test 200
dataset [NerfSynthFtDataset] was created
full datasets test:
-----------------------------------Testing-----------------------------------
yea, i have some implementation issue with pycuda, but it's not an error. The pycuda device handle release problem is confusing to me. it seems your chair results are even better than mine. The pycuda pytorch integration is tricky. I think I have seen the cublasSgemm error before but I couldn't reconstruct the error on any of my machine now. The stack overflow suggests to add line 20 in the query_point_indices.py file, but i found no needs so i commented it out. you can also uncomment torch.cuda.synchronize() in its query_grid_point_index function, to see which step the pycuda breaks.
I have uncomment torch.cuda.synchronize()
in query_grid_point_index function
, but it doesn't seem to solve the problem, is this normal? When I run the bash dev_scripts/w_n360/chair.sh
command, Its output is shown below :
`++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
-----------------------------------Generate Points-----------------------------------
loading model ../checkpoints/MVSNet/model_000014.ckpt
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! best
loading mvs from ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_mvs.pth
------------------- Networks -------------------
[Network mvs] Total number of parameters: 0.382M
0%| | 1/555 [00:01<13:52, 1.50s/it]
100%|██████████| 555/555 [03:31<00:00, 2.62it/s]
0%| | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [17:36<00:00, 1.90s/it]
xyz_world_all torch.Size([97493176, 3]) torch.Size([97493176, 1]) torch.Size([97493176])
%%%%%%%%%%%%% getattr(dataset, spacemin, None) None
vishull_mask torch.Size([97493176])
alpha masking xyz_world_all torch.Size([34014530, 3]) torch.Size([34014530, 1])
xyz_residual torch.Size([34014530])
min_idx torch.Size([538111])
after voxelize: torch.Size([538111, 3]) torch.Size([538111, 1])
0%| | 0/555 [00:00<?, ?it/s]
100%|██████████| 555/555 [00:07<00:00, 72.08it/s]
self.model_names ['mvs']
opt.act_type!!!!!!!!! LeakyReLU
querier device cuda:0 0
no neural points as nn.Parameter
model [MvsPointsVolumetricModel] was created
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
opt.resume_iter!!!!!!!!! best
loading ray_marching from ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)
Traceback (most recent call last):
File "train_ft.py", line 1084, in <module>
main()
File "train_ft.py", line 940, in main
model.optimize_parameters(total_steps=total_steps)
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 215, in optimize_parameters
self.forward()
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/mvs_points_volumetric_model.py", line 126, in forward
self.output = self.run_network_models()
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 85, in run_network_models
return self.fill_invalid(self.net_ray_marching(**self.input), self.input)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 270, in forward
decoded_features, ray_valid, weight, conf_coefficient = self.aggregator(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, grid_vox_sz)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 811, in forward
output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists)
File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp
alpha = self.raw2out_density(self.alpha_branch(alpha_in))
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
end loading
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
dev_scripts/w_n360/drums.sh: line 164: 1184025 Aborted (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --which_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_test $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_dir --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limit_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode $agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feature_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_model $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --color_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_occ --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_grad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --default_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_iter $prune_max_iter --far_thresh $far_thresh --debug
opt.color_loss_items ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor']
[1;31;48m++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Debug Mode
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++[1;37;0m
dataset total: train 100
dataset [NerfSynthFtDataset] was created
../checkpoints/nerfsynth/drums/*_net_ray_marching.pth
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Continue training from 0 epoch
Iter: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
opt.act_type!!!!!!!!! LeakyReLU
self.points_embeding torch.Size([1, 538111, 32])
querier device cuda:0 0
neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)]
model [MvsPointsVolumetricModel] was created
opt.resume_iter!!!!!!!!! 0
loading ray_marching from ../checkpoints/nerfsynth/drums/0_net_ray_marching.pth
------------------- Networks -------------------
[Network ray_marching] Total number of parameters: 22.942M
------------------------------------------------
# training images = 100
saving model (drums, epoch 0, total_steps 0)
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arrays = [asanyarray(arr) for arr in arrays]
/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
optimizer 1, learning rate = 0.0005000
optimizer 2, learning rate = 0.0019998
End of iteration 40 Number of batches 40 Time taken: 7.17s
[Average Loss] total: 0.0422415882 ray_masked_coarse_raycolor: 0.0424671248 ray_masked_coarse_raycolor_psnr: 13.9493732452 ray_miss_coarse_raycolor: 0.0365000255 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0117533728 coarse_raycolor_psnr: 19.4890022278 conf_coefficient: -2.2853794098
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019996
End of iteration 80 Number of batches 40 Time taken: 6.70s
[Average Loss] total: 0.0286709573 ray_masked_coarse_raycolor: 0.0291084927 ray_masked_coarse_raycolor_psnr: 15.4500093460 ray_miss_coarse_raycolor: 0.0000000000 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0080277314 coarse_raycolor_psnr: 21.0282936096 conf_coefficient: -4.4054217339
optimizer 1, learning rate = 0.0004999
optimizer 2, learning rate = 0.0019994
End of iteration 120 Number of batches 40 Time taken: 7.11s
[Average Loss] total: 0.0262186974 ray_masked_coarse_raycolor: 0.0267978553 ray_masked_coarse_raycolor_psnr: 15.8921508789 ray_miss_coarse_raycolor: 0.0000593065 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0075231828 coarse_raycolor_psnr: 21.3510837555 conf_coefficient: -5.8215517998 `
im not sure, looks like a pycuda thing. but the optimization afterwards seem still working well.
Thank you for your careful reply, I will ignore this PyCUDA message if it has no impact. Also, if you have time, could you update the description of each parser argument? Some of the 'help' seems to be a direct copy and paste. This will help us to explore PointNeRF more deeply and to make changes in the code. Thank you!
Sure, good suggestions
Hi. Have you solved this problem yet? I have met the same problem, RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
, I am using pytorch 1.10.0+CUDA 11.3, I don't know if it was caused by pytorch.
hi, i think pycuda is the cause of this error. you can try build from source
Thanks for your reply, when I replaced my graphics card and installed Pytorch==1.8.1 and CUDA==10.2, it worked well. By the way, I have another question, I want to use the data-set collected by myself for testing. However, I don't know how to write /data/dtu_configs/pairs.th in your program. I am looking forward to your reply.
Thanks for your reply, when I replaced my graphics card and installed Pytorch==1.8.1 and CUDA==10.2, it worked well. By the way, I have another question, I want to use the data-set collected by myself for testing. However, I don't know how to write /data/dtu_configs/pairs.th in your program. I am looking forward to your reply.
And I really want to know how you can get these self.pair_idx.
> /home/pth-algo/Desktop/pointnerf-master/data/dtu_ft_dataset.py(110)initialize() -> print("dtu_ft train id", self.pair_idx[0]) (Pdb) self.pair_idx [[25, 21, 33, 22, 14, 15, 26, 30, 31, 35, 34, 43, 46, 29, 16, 36], [32, 24, 23, 44]]
hi these pairs are from MVSNeRF / MVSNet . i have no idea why they use these, you can ask them about it
Thanks for your quick reply, in MVSNeRF, I have found a "Pairs generation" section in the "renderer.ipynb" for the pairing generation.
I am also getting a similar error to the others above:
Traceback (most recent call last): File "train_ft.py", line 1094, in
main() File "train_ft.py", line 826, in main test(model, test_dataset, Visualizer(test_opt), test_opt, test_bg_info, test_steps=total_steps) File "train_ft.py", line 315, in test model.test() File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/mvs_points_volumetric_model.py", line 335, in test self.output = self.run_network_models() File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/neural_points_volumetric_model.py", line 88, in run_network_models return self.fill_invalid(self.net_ray_marching(self.input), self.input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(inputs[0], kwargs[0]) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/neural_points_volumetric_model.py", line 419, in forward sample_ray_dirs, vsize, grid_vox_sz) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/aggregators/pointaggregators.py", line 824, in forward output, = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists) File "/group-volume/Neural-Implicit-Functions/ttaa/point-nerf/pointnerf/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp temp = self.color_branch(color_in) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
I have tried with pytorch 1.10.0 + cuda 11.3, and also with pytorch 1.8.1 + cuda 10.2, but get the same error in the same place. I have also, in both cases, built and installed pycuda manually.
This only happens to me with the DTU scripts. I can run the w_n360 scripts just fine.
Any suggestions?
I have uncomment
torch.cuda.synchronize()
inquery_grid_point_index function
, but it doesn't seem to solve the problem, is this normal? When I run thebash dev_scripts/w_n360/chair.sh
command, Its output is shown below :`++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Debug Mode ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ dataset total: train 100 dataset [NerfSynthFtDataset] was created ../checkpoints/nerfsynth/drums/*_net_ray_marching.pth -----------------------------------Generate Points----------------------------------- loading model ../checkpoints/MVSNet/model_000014.ckpt model [MvsPointsVolumetricModel] was created opt.resume_iter!!!!!!!!! best loading mvs from ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_mvs.pth ------------------- Networks ------------------- [Network mvs] Total number of parameters: 0.382M 0%| | 1/555 [00:01<13:52, 1.50s/it] 100%|██████████| 555/555 [03:31<00:00, 2.62it/s] 0%| | 0/555 [00:00<?, ?it/s] 100%|██████████| 555/555 [17:36<00:00, 1.90s/it] xyz_world_all torch.Size([97493176, 3]) torch.Size([97493176, 1]) torch.Size([97493176]) %%%%%%%%%%%%% getattr(dataset, spacemin, None) None vishull_mask torch.Size([97493176]) alpha masking xyz_world_all torch.Size([34014530, 3]) torch.Size([34014530, 1]) xyz_residual torch.Size([34014530]) min_idx torch.Size([538111]) after voxelize: torch.Size([538111, 3]) torch.Size([538111, 1]) 0%| | 0/555 [00:00<?, ?it/s] 100%|██████████| 555/555 [00:07<00:00, 72.08it/s] self.model_names ['mvs'] opt.act_type!!!!!!!!! LeakyReLU querier device cuda:0 0 no neural points as nn.Parameter model [MvsPointsVolumetricModel] was created neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)] opt.resume_iter!!!!!!!!! best loading ray_marching from ../checkpoints/init/dtu_dgt_d012_img0123_conf_agg2_32_dirclr20/best_net_ray_marching.pth ------------------- Networks ------------------- [Network ray_marching] Total number of parameters: 22.942M ------------------------------------------------ # training images = 100 saving model (drums, epoch 0, total_steps 0) Traceback (most recent call last): File "train_ft.py", line 1084, in <module> main() File "train_ft.py", line 940, in main model.optimize_parameters(total_steps=total_steps) File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 215, in optimize_parameters self.forward() File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/mvs_points_volumetric_model.py", line 126, in forward self.output = self.run_network_models() File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 85, in run_network_models return self.fill_invalid(self.net_ray_marching(**self.input), self.input) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/neural_points_volumetric_model.py", line 270, in forward decoded_features, ray_valid, weight, conf_coefficient = self.aggregator(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, grid_vox_sz) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 811, in forward output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists) File "/home/lee/Desktop/pointNeRF_lab/pointnerf-master/run/../models/aggregators/point_aggregators.py", line 605, in viewmlp alpha = self.raw2out_density(self.alpha_branch(alpha_in)) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` end loading ------------------------------------------------------------------- PyCUDA ERROR: The context stack was not empty upon module cleanup. ------------------------------------------------------------------- A context was still active when the context stack was being cleaned up. At this point in our execution, CUDA may already have been deinitialized, so there is no way we can finish cleanly. The program will be aborted now. Use Context.pop() to avoid this problem. ------------------------------------------------------------------- dev_scripts/w_n360/drums.sh: line 164: 1184025 Aborted (core dumped) python3 train_ft.py --name $name --scan $scan --data_root $data_root --dataset_name $dataset_name --model $model --which_render_func $which_render_func --which_blend_func $which_blend_func --out_channels $out_channels --num_pos_freqs $num_pos_freqs --num_viewdir_freqs $num_viewdir_freqs --random_sample $random_sample --random_sample_size $random_sample_size --batch_size $batch_size --maximum_step $maximum_step --plr $plr --lr $lr --lr_policy $lr_policy --lr_decay_iters $lr_decay_iters --lr_decay_exp $lr_decay_exp --gpu_ids $gpu_ids --checkpoints_dir $checkpoints_dir --save_iter_freq $save_iter_freq --niter $niter --niter_decay $niter_decay --n_threads $n_threads --pin_data_in_memory $pin_data_in_memory --train_and_test $train_and_test --test_num $test_num --test_freq $test_freq --test_num_step $test_num_step --test_color_loss_items $test_color_loss_items --print_freq $print_freq --bg_color $bg_color --split $split --which_ray_generation $which_ray_generation --near_plane $near_plane --far_plane $far_plane --dir_norm $dir_norm --which_tonemap_func $which_tonemap_func --load_points $load_points --resume_dir $resume_dir --resume_iter $resume_iter --feature_init_method $feature_init_method --agg_axis_weight $agg_axis_weight --agg_distance_kernel $agg_distance_kernel --radius_limit_scale $radius_limit_scale --depth_limit_scale $depth_limit_scale --vscale $vscale --kernel_size $kernel_size --SR $SR --K $K --P $P --NN $NN --agg_feat_xyz_mode $agg_feat_xyz_mode --agg_alpha_xyz_mode $agg_alpha_xyz_mode --agg_color_xyz_mode $agg_color_xyz_mode --save_point_freq $save_point_freq --raydist_mode_unit $raydist_mode_unit --agg_dist_pers $agg_dist_pers --agg_intrp_order $agg_intrp_order --shading_feature_mlp_layer0 $shading_feature_mlp_layer0 --shading_feature_mlp_layer1 $shading_feature_mlp_layer1 --shading_feature_mlp_layer2 $shading_feature_mlp_layer2 --shading_feature_mlp_layer3 $shading_feature_mlp_layer3 --shading_feature_num $shading_feature_num --dist_xyz_freq $dist_xyz_freq --shpnt_jitter $shpnt_jitter --shading_alpha_mlp_layer $shading_alpha_mlp_layer --shading_color_mlp_layer $shading_color_mlp_layer --which_agg_model $which_agg_model --color_loss_weights $color_loss_weights --num_feat_freqs $num_feat_freqs --dist_xyz_deno $dist_xyz_deno --apply_pnt_mask $apply_pnt_mask --point_features_dim $point_features_dim --color_loss_items $color_loss_items --feedforward $feedforward --trgt_id $trgt_id --depth_vid $depth_vid --ref_vid $ref_vid --manual_depth_view $manual_depth_view --pre_d_est $pre_d_est --depth_occ $depth_occ --manual_std_depth $manual_std_depth --visual_items $visual_items --appr_feature_str0 $appr_feature_str0 --init_view_num $init_view_num --feat_grad $feat_grad --conf_grad $conf_grad --dir_grad $dir_grad --color_grad $color_grad --depth_conf_thresh $depth_conf_thresh --bgmodel $bgmodel --vox_res $vox_res --act_type $act_type --geo_cnsst_num $geo_cnsst_num --point_conf_mode $point_conf_mode --point_dir_mode $point_dir_mode --point_color_mode $point_color_mode --normview $normview --prune_thresh $prune_thresh --prune_iter $prune_iter --full_comb $full_comb --sparse_loss_weight $sparse_loss_weight --default_conf $default_conf --prob_freq $prob_freq --prob_num_step $prob_num_step --prob_thresh $prob_thresh --prob_mul $prob_mul --prob_kernel_size $prob_kernel_size --prob_tiers $prob_tiers --alpha_range $alpha_range --ranges $ranges --vid $vid --vsize $vsize --wcoord_query $wcoord_query --max_o $max_o --zero_one_loss_items $zero_one_loss_items --zero_one_loss_weights $zero_one_loss_weights --prune_max_iter $prune_max_iter --far_thresh $far_thresh --debug opt.color_loss_items ['ray_masked_coarse_raycolor', 'ray_miss_coarse_raycolor', 'coarse_raycolor'] �[1;31;48m++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Debug Mode ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++�[1;37;0m dataset total: train 100 dataset [NerfSynthFtDataset] was created ../checkpoints/nerfsynth/drums/*_net_ray_marching.pth ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Continue training from 0 epoch Iter: 0 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ opt.act_type!!!!!!!!! LeakyReLU self.points_embeding torch.Size([1, 538111, 32]) querier device cuda:0 0 neural_params [('module.neural_points.xyz', torch.Size([538111, 3]), False), ('module.neural_points.points_embeding', torch.Size([1, 538111, 32]), True), ('module.neural_points.points_conf', torch.Size([1, 538111, 1]), True), ('module.neural_points.points_dir', torch.Size([1, 538111, 3]), True), ('module.neural_points.points_color', torch.Size([1, 538111, 3]), True), ('module.neural_points.Rw2c', torch.Size([3, 3]), False)] model [MvsPointsVolumetricModel] was created opt.resume_iter!!!!!!!!! 0 loading ray_marching from ../checkpoints/nerfsynth/drums/0_net_ray_marching.pth ------------------- Networks ------------------- [Network ray_marching] Total number of parameters: 22.942M ------------------------------------------------ # training images = 100 saving model (drums, epoch 0, total_steps 0) /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. arrays = [asanyarray(arr) for arr in arrays] /home/lee/miniconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] optimizer 1, learning rate = 0.0005000 optimizer 2, learning rate = 0.0019998 End of iteration 40 Number of batches 40 Time taken: 7.17s [Average Loss] total: 0.0422415882 ray_masked_coarse_raycolor: 0.0424671248 ray_masked_coarse_raycolor_psnr: 13.9493732452 ray_miss_coarse_raycolor: 0.0365000255 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0117533728 coarse_raycolor_psnr: 19.4890022278 conf_coefficient: -2.2853794098 optimizer 1, learning rate = 0.0004999 optimizer 2, learning rate = 0.0019996 End of iteration 80 Number of batches 40 Time taken: 6.70s [Average Loss] total: 0.0286709573 ray_masked_coarse_raycolor: 0.0291084927 ray_masked_coarse_raycolor_psnr: 15.4500093460 ray_miss_coarse_raycolor: 0.0000000000 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0080277314 coarse_raycolor_psnr: 21.0282936096 conf_coefficient: -4.4054217339 optimizer 1, learning rate = 0.0004999 optimizer 2, learning rate = 0.0019994 End of iteration 120 Number of batches 40 Time taken: 7.11s [Average Loss] total: 0.0262186974 ray_masked_coarse_raycolor: 0.0267978553 ray_masked_coarse_raycolor_psnr: 15.8921508789 ray_miss_coarse_raycolor: 0.0000593065 ray_miss_coarse_raycolor_psnr: inf coarse_raycolor: 0.0075231828 coarse_raycolor_psnr: 21.3510837555 conf_coefficient: -5.8215517998 `
I also encountered the same problem, I tried changing the version of torch/cuda, and changing the machine, and I also checked the shape of the input, but it didn't solve the problem. In my tests, this error occurs randomly: the input subsequence tends to be run correctly. Therefore, I use the following method to replace the original code, although it will lose a certain speed, but the code can be guaranteed to run.
in models/point raw:
color_output = self.raw2out_color(self.color_branch(color_in))
After:
try:
color_output = self.raw2out_color(self.color_branch(color_in))
cal_fail = False
except:
cal_fail = True
while cal_fail:
max_iter = int(color_in.shape[0] * torch.rand(1))
try:
color_output_part = []
for i_start in range(0, color_in.shape[0], max_iter):
i_end = min(color_in.shape[0], i_start + max_iter)
color_output_part.append(self.raw2out_color(self.color_branch(color_in[i_start:i_end])))
color_output = torch.cat(color_output_part, dim=0)
cal_fail = False
except:
cal_fail = True
Since this error( cublasSgemm ) occurs randomly, I think this aspect can be avoided in all places where this error occurs. Of course, this should not be the most elegant solution, looking forward to a better solution.
I also come to CUBLAS_STATUS_EXECUTION_FAILED
problem. Really need help.
Although I tried code
try:
color_output = self.raw2out_color(self.color_branch(color_in))
cal_fail = False
except:
cal_fail = True
while cal_fail:
max_iter = int(color_in.shape[0] * torch.rand(1))
try:
color_output_part = []
for i_start in range(0, color_in.shape[0], max_iter):
i_end = min(color_in.shape[0], i_start + max_iter)
color_output_part.append(self.raw2out_color(self.color_branch(color_in[i_start:i_end])))
color_output = torch.cat(color_output_part, dim=0)
cal_fail = False
except:
cal_fail = True
Then the error became
File "/home/zhangchuanyi/pointnerf/run/../models/aggregators/point_aggregators.py", line 844, in forward
output, _ = getattr(self, self.which_agg_model, None)(sampled_color, sampled_Rw2c, sampled_dir, sampled_conf, sampled_embedding, sampled_xyz_pers, sampled_xyz, sample_pnt_mask, sample_loc, sample_loc_w, sample_ray_dirs, vsize, weight * conf_coefficient, pnt_mask_flat, pts, viewdirs, total_len, ray_valid, in_shape, dists)
File "/home/zhangchuanyi/pointnerf/run/../models/aggregators/point_aggregators.py", line 547, in viewmlp
feat = self.block1(feat)
File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zhangchuanyi/anaconda3/envs/pointnerf/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
@zhangchuanyi96 I still encounter this problem, have you solved this issue?
Hello authors, thank you very much for your paper and for releasing code! I am very excited about this work.
I am trying to run inference on a model that has been trained for point initialization (not with per-scene optimization). The README indicates that it should be possible to perform inference on DUT with
bash dev_scripts/dtu_test_inf/inftest_scan8.sh
.When I run this script, I get an error:
I am using CUDA 11.1 and I believe that everything CUDA-related is installed correctly.
I was also confused by something else: this script calls
train_ft.py
, nottest.py
as I would have expected. In fact, there is not a single script indev_scripts
that callstest.py
. Am I misunderstanding something here, or istest.py
actually never used?