Unable to execute the program, error during cpp_extension.py running at start

RiversJohn commented 2 years ago

(dmodel) D:\NVIDIATools\nvdiffmodeling-main>python train.py --config configs/spot.json
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5'
Using C:\Users\redacted\AppData\Local\torch_extensions\torch_extensions\Cache\py36_cpu as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file C:\Users\redacted\AppData\Local\torch_extensions\torch_extensions\Cache\py36_cpu\renderutils_plugin\build.ninja...
Traceback (most recent call last):
  File "train.py", line 20, in <module>
    import src.renderutils as ru
  File "D:\NVIDIATools\nvdiffmodeling-main\src\renderutils\__init__.py", line 9, in <module>
    from .ops import xfm_points, xfm_vectors, image_loss, prepare_shading_normal, lambert, pbr_specular, pbr_bsdf, _fresnel_shlick, _ndf_ggx, _lambda_ggx, _masking_smith
  File "D:\NVIDIATools\nvdiffmodeling-main\src\renderutils\ops.py", line 61, in <module>
    torch.utils.cpp_extension.load(name='renderutils_plugin', sources=source_paths, extra_ldflags=ldflags, with_cuda=True, verbose=True)
  File "C:\Users\redacted\anaconda3\envs\dmodel\lib\site-packages\torch\utils\cpp_extension.py", line 1136, in load
    keep_intermediates=keep_intermediates)
  File "C:\Users\redacted\anaconda3\envs\dmodel\lib\site-packages\torch\utils\cpp_extension.py", line 1347, in _jit_compile
    is_standalone=is_standalone)
  File "C:\Users\redacted\anaconda3\envs\dmodel\lib\site-packages\torch\utils\cpp_extension.py", line 1445, in _write_ninja_file_and_build_library
    is_standalone=is_standalone)
  File "C:\Users\redacted\anaconda3\envs\dmodel\lib\site-packages\torch\utils\cpp_extension.py", line 1834, in _write_ninja_file_to_build_library
    cuda_flags = common_cflags + COMMON_NVCC_FLAGS + _get_cuda_arch_flags()
  File "C:\Users\redacted\anaconda3\envs\dmodel\lib\site-packages\torch\utils\cpp_extension.py", line 1606, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'
IndexError: list index out of range

(dmodel) D:\NVIDIATools\nvdiffmodeling-main>

After installing the prerequisites, only difference to the instruction is CUDA 11.5 (and i changed the version to the pip install line to that same version) Getting the above when attempting to run the example spot cow training.

Something i'm doing wrong, incompatibility with versions or something else entirely?

1LOVESJohnny commented 2 years ago

Hi @RiversJohn, I met exactly the same problem as you did. Here is my solution.

First, I checked the cudnn installation of my environment and happened to find out there was no cudnn. Got the corresponding version windows+cuda 11.5 of cudnn via NVIDIA developer and copied files to corresponding CUDA directories. Made sure the cudnn installation was successful by the first answer.

Next, re-activated the conda environment 'dmodel'. If it is still not working, check the version of pytorch. When the CUDA setting is not correctly done, the installation program will automatically choose a cpu version of pytorch. Uninstall the cpu version of pytorch via:

conda uninstall pytorch

It may take a while and then you can re-install the gpu version of pytorch. As your CUDA version is the latest 11.5 (same as mine) while pytorch has not the latest update yet, a cuda11.3 pytorch would be compatible.

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

Then you may have a perfect environment. That's all. Hope it helps :)

JHnvidia commented 2 years ago

Hi,

I haven't seen the error before, but to me it seems like some clash between cuda and pytorch. The cpp_extension.py file is part of pytorch's functionality for compiling C-plugins, which makes me suspect that it doesn't find your cuda installation. As mentioned above pytorch supports 11.3 latest, so it might be worth downgrading CUDA if you can't get things working.

Also note that there is an additional error No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5', which could indiacte that pytorch is trying to use the 11.5 cuda path.

I would suggest making a fresh anaconda container, and just installing pytorch. conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch then test that it installed corerctly, it should look similar to this (you can also check your current container)

c:\> python
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.__version__
'1.10.1+cu113'

Once that is working, make sure the MS Visual c++ / CUDA installation requirements are met. I would recommend Cuda 11.3, but it should be possible to make it work with 11.5. It might be worthwile to try running a smaller example using nvdiffrast: https://github.com/NVlabs/nvdiffrast. Nvdiffrast uses the same method of compiling C-code, so once that it working you should have all pieces in place.

Good luck, hope this helps.

RiversJohn commented 2 years ago

Thank you for the replies, i will likely have time to try this again next week and will report back after that, the solutions certainly match what i guessed was wrong

NVlabs / nvdiffmodeling

Unable to execute the program, error during cpp_extension.py running at start #12