Open barikata1984 opened 4 months ago
I have looked into the issue and got some findings.
This error happens when you run the app with the following setting:
traicer.raymarch-type
: uniform
interactive
: True
\ The code section terminating the app is https://github.com/NVIDIAGameWorks/kaolin-wisp/blob/931707e50f1511fdb4af55eeb4aed4df23b7c2b1/wisp/csrc/ops/uniform_sample_cuda.cu#L95
While investigating the code, I noticed that cudaGetLastError
sometimes returns a non-zero enum value, which is mainly 9, meaning cudaErrorInvalidConfiguration
, then the AT_CUDA_CHECK
triggers the termination. A new finding is that the invalid configuration is actually raised even with cuda 11.7. Fortunate or not, AT_CUDA_CHECK,
which is actually a wrapper of C10_CUDA_CHECK
, does not handle the error code properly and does not terminate the app in PyTorch 1. even if the error value is given. However, C10_CUDA_CHECK
has been implemented differently since PyTorch 2. and started to terminate the app.
I also run the app without the interactive viewer. Then I noticed the code runs completely in this case. Combined with the above situation, maybe something goes wrong on the interactive viewer side, and the error is caught by the AT_CUDA_CHECK
in uniform_sample_cuda.cu
.
As a workaround, AT_CUDA_CHECK
can be commented out. As the case with PyTorch 1.*, the app works even though I am not sure it is a good situation.
I may further investigate the issue, but I have no experience with GUI app coding at all. So, if someone joins in solving this issue, I would really appreciate it.
Hi, everyone
I am trying to switch my env from
torch1.13.1
andcuda117
totorch.2.1.1
andcuda121
. After installation, I trained a nerf with--tracer.raymarch-type uniform
but it failed with an error message like below:I looked into
_raymarch_uniform
and found out thatuniform_sample_cuda
fails whenspc_render.unbatched_raytrace
returns empty tensors forridx
,pidx
, anddepth
, as you can see in the earlier half of the error message. I also confirmed thatridx
,pidx
, anddepth
can also be empty withtorch1.13.1
andcuda117
while I did not experience that error. Besides, I faced the error withtorch1.13.1
andcuda118
. So, I believe thatuniform_sample_cuda
's behaviour differs betweencuda117
and later versions. If I had an experience in Cuda coding, I could debug the method. But I do not know how to code Cuda right now. So, does anybody debug it?Thanks in advance!