NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more
https://nvlabs.github.io/instant-ngp
Other
15.83k stars 1.9k forks source link

cutlass_matmul.h:332 status failed with error Error Internal #1101

Closed ookey closed 1 year ago

ookey commented 1 year ago

Hello, thanks for your valuable work. Setting density_network's n_output_dims to 17 or above I get the following error message:

ERROR    Uncaught exception: /matt_disk/3d/dev/instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/cutlass_matmul.h:332 status failed with error Error Internal

My config:

My config.json:

{
    "parent": "small.json",
    "network": {
        "n_output_dims": 17
    }
}

I understood that this 17 limit switches on the use of cutlass_matmul's fc_multiply though I can't figure out why it's failing.

Here is the call stack: image

sxs4337 commented 1 year ago

same error here with a custom dataset. The fox example did work on my setup though.

Update- For the custom dataset, it worked for me after rebuilding and regenerating the dataset (with colomap)

roey1rg commented 1 year ago

same problem here (with the fox example) running on GForce 1060 (is it possible?)

RajaeeKh commented 1 year ago

Any solution?

RajaeeKh commented 1 year ago

Any solution?

Ok, I managed to solve it. The problem on my end happened because I built ngp inside conda env, and since conda env has its own separate version of Cuda (because of pytorch), ngp will try to use it. The solution is to remove all conda env Cuda paths from $PATH before building ngp, and thus ngp will use original OS version of Cuda.

roey1rg commented 1 year ago

I don't have conda on my system and I'm still having this problem...

Yanbin360 commented 1 year ago

Same issue, it might be related to the aabb_scale in transform.json, this error happened if set aabb_scale to 8 or bigger.

Tom94 commented 1 year ago

Could you try again with the latest code from master / latest binaries?

ookey commented 1 year ago

Hello, It's working! Since I slightly changed my config, here's the way I successfully tested today.

$ git log --oneline -n 1
a0090e4 (HEAD, origin/master, origin/HEAD) NeRF: fix broken training on some scenes
$ nvidia-smi
Mon Jan 16 10:03:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:15:00.0  On |                  N/A |
| 30%   30C    P8    31W / 350W |    877MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$ TCNN_CUDA_ARCHITECTURES=86 cmake -B allbuilds/lin-rel-86-cu118/ .
$ cmake --build allbuilds/lin-rel-86-cu118/ -j
$ cat configs/nerf/test.json 
{
    "parent": "small.json",
    "network": {
        "n_output_dims": 17
    }
}
$ ./allbuilds/lin-rel-86-cu118/instant-ngp --scene /data/captures/sony/renaud-stephane/ --config configs/nerf/test.json
ookey commented 1 year ago

closing it

leonlenk commented 1 year ago

I tried nearly everything here to fix this bug when I ran it on my Nvidia gpu, and maybe I didn't implement them correctly, but if someone finds this thread in the future the fix I used that solved it was I went into the header file cutlass_matmul.h (its in dependencies and header files folders) and commented out line 332 where it called the error thrower and now it runs consistently, I guess whatever function is erroring out isn't actually used in the product and can probably be ignored, but just in case use this as a last resort.

weijielyu commented 1 year ago

I tried nearly everything here to fix this bug when I ran it on my Nvidia gpu, and maybe I didn't implement them correctly, but if someone finds this thread in the future the fix I used that solved it was I went into the header file cutlass_matmul.h (its in dependencies and header files folders) and commented out line 332 where it called the error thrower and now it runs consistently, I guess whatever function is erroring out isn't actually used in the product and can probably be ignored, but just in case use this as a last resort.

I also commented out the line (now it's 330) and the problem was solved. Thanks, leon! Is there a more proper way to do this?

renwuli commented 10 months ago

same issue here

Anji-Builds commented 6 months ago

@ookey what changes did you make in your config?