cudaCheckError() failed: unspecified launch failure

eneserdo commented 2 years ago

Hi, when I run ./experiments/scripts/demo.sh, I am getting the following error:

... object 0, class 019_pitcher_base, z 0.5498372912406921, z new 0.6564772725105286 object 1, class 008_pudding_box, z 0.6858921647071838, z new 0.7096845507621765 object 2, class 002_master_chef_can, z 0.5724276304244995, z new 0.6095401048660278 object 3, class 052_extra_large_clamp, z 0.6460237503051758, z new 0.6376593708992004 object 4, class 011_banana, z 0.6999950408935547, z new 0.7429261207580566 /opt/conda/conda-bld/pytorch_1591914742272/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated: nonzero(Tensor input, , Tensor out) Consider using one of the following signatures instead: nonzero(Tensor input, , bool as_tuple) sdf 27338 points for object 0, class 10 019_pitcher_base sdf 6888 points for object 1, class 6 008_pudding_box sdf 16038 points for object 2, class 0 002_master_chef_can sdf 11314 points for object 3, class 18 052_extra_large_clamp sdf 3385 points for object 4, class 9 011_banana sdf with 64963 points cudaCheckError() failed: unspecified launch failure

I tried on multiple machines. Here is the full error log

My setup: Ubuntu 20.4 CUDA 10.1 PyTorch 1.4

Any help will be appreciated.

tsrobcvai commented 1 year ago

Hi, when I run ./experiments/scripts/demo.sh, I am getting the following error:

... object 0, class 019_pitcher_base, z 0.5498372912406921, z new 0.6564772725105286 object 1, class 008_pudding_box, z 0.6858921647071838, z new 0.7096845507621765 object 2, class 002_master_chef_can, z 0.5724276304244995, z new 0.6095401048660278 object 3, class 052_extra_large_clamp, z 0.6460237503051758, z new 0.6376593708992004 object 4, class 011_banana, z 0.6999950408935547, z new 0.7429261207580566 /opt/conda/conda-bld/pytorch_1591914742272/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated: nonzero(Tensor input, , Tensor out) Consider using one of the following signatures instead: nonzero(Tensor input, , bool as_tuple) sdf 27338 points for object 0, class 10 019_pitcher_base sdf 6888 points for object 1, class 6 008_pudding_box sdf 16038 points for object 2, class 0 002_master_chef_can sdf 11314 points for object 3, class 18 052_extra_large_clamp sdf 3385 points for object 4, class 9 011_banana sdf with 64963 points cudaCheckError() failed: unspecified launch failure

I tried on multiple machines. Here is the full error log

My setup: Ubuntu 20.4 CUDA 10.1 PyTorch 1.4

Any help will be appreciated.

I also met this problem. Ubuntu 20.4, CUDA 11.1, PyTorch 1.8. Have you solved this problem?

eneserdo commented 1 year ago

Nope

wetoo-cando commented 11 months ago

Same problem here when I run ./experiments/scripts/dex_ycb_test_s0.sh 0 with Ubuntu 20.04 Cuda 11.1 torch 1.10.1+cu111.

I am in a python-venv inside a docker container based on https://hub.docker.com/r/nvidia/cudagl.

@eneserdo @mcgilltaosun could you solve this?

eneserdo commented 11 months ago

I dropped my job because of this error. Please do not tag me anymore. Why nvidia, why are you not reproducible

wetoo-cando commented 10 months ago

A little more print debugging shows the exact location of the error:

object 0, class 025_mug, z 0.7601078152656555, z new 0.8151350021362305
object 1, class 003_cracker_box, z 0.9675762057304382, z new 1.0729904174804688
object 2, class 002_master_chef_can, z 0.7824445962905884, z new 0.7986501455307007
sdf 5599 points for object 0, class 13 025_mug
sdf 10896 points for object 1, class 1 003_cracker_box
sdf 8007 points for object 2, class 0 002_master_chef_can
sdf with 24502 points
sdf_matching_loss_kernel.cu: cudaCheckError() failed (cudaDeviceSynchronize): unspecified launch failure

It happens inside the function sdf_loss_cuda_forward() at line 276 in the sdf_matching_loss_kernel.cu file.

No idea what to look for / how to debug further though. Any help would be appreciated.

namGGG commented 6 months ago

I'm stuck in the middle... GPU RTX 3090 Ubuntu 20.04 CUDA 11.1 Pytorch 1.8.2 LTS

/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:3454: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  warnings.warn(
cudaGraphicsGLRegisterImage failed: 304
cudaGraphicsMapResources failed: 400
cudaGraphicsSubResourceGetMappedArray failed: 400
cudaMemcpy2DFromArray failed: 709
cudaGraphicsUnmapResources failed: 400
cudaGraphicsGLRegisterImage failed: 304
cudaGraphicsMapResources failed: 400
cudaGraphicsSubResourceGetMappedArray failed: 400
cudaMemcpy2DFromArray failed: 709
cudaGraphicsUnmapResources failed: 400
cudaGraphicsGLRegisterImage failed: 304
cudaGraphicsMapResources failed: 400
cudaGraphicsSubResourceGetMappedArray failed: 400
cudaMemcpy2DFromArray failed: 709
cudaGraphicsUnmapResources failed: 400
object 0, class 019_pitcher_base, z 0.5521460175514221, z new -0.2002505660057068
object 1, class 008_pudding_box, z 0.6852722764015198, z new -0.021693646907806396
object 2, class 002_master_chef_can, z 0.5711051225662231, z new -0.11909496784210205
object 3, class 052_extra_large_clamp, z 0.6500653028488159, z new 0.3591681122779846
object 4, class 011_banana, z 0.7000908255577087, z new 0.8758471608161926
sdf 0 points for object 0, class 10 019_pitcher_base, no refinement
sdf 0 points for object 1, class 6 008_pudding_box, no refinement
sdf 0 points for object 2, class 0 002_master_chef_can, no refinement
sdf 0 points for object 3, class 18 052_extra_large_clamp, no refinement
sdf 499 points for object 4, class 9 011_banana
sdf with 499 points
cudaCheckError() failed: unspecified launch failure

NVlabs / PoseCNN-PyTorch

cudaCheckError() failed: unspecified launch failure #28