facebookresearch / pytorch3d

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data
https://pytorch3d.org/
Other
8.81k stars 1.32k forks source link

Unable to compile and install with CUDA 11.7 on Windows - undefined identifier errors from nvcc #1593

Closed walt-jones closed 1 year ago

walt-jones commented 1 year ago

I'm unable to compile and install pytorch3d with CUDA on Windows. Seems to be a conflict or other issue with CUDA 11.7 itself somewhere in the chain of libraries.

With torch 2.0.1 and torchvision 0.15.2 installed without CUDA, pytorch3d builds and installs without issue. With torch 2.0.1 and torchvision 0.15.2 installed with CUDA using the "--index-url https://download.pytorch.org/whl/cu117" on pip (which itself compiles and installs without issue), the compile and install of pytorch3d (with ninja disabled so I can see what's really going on) throws the following errors from nvcc:

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(418): error: identifier "__clusterGridDimInClusters" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(427): error: identifier "__clusterIdx" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(521): error: identifier "__clusterDimIsSpecified" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(526): error: identifier "__cluster_barrier_arrive" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(531): error: identifier "__cluster_barrier_wait" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(542): error: identifier "__cluster_query_shared_rank" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(553): error: identifier "__clusterRelativeBlockIdx" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(558): error: identifier "__clusterRelativeBlockRank" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(568): error: identifier "__clusterDim" is undefined
      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(573): error: identifier "__clusterSizeInBlocks" is undefined

All of the CUDA cooperative group libraries are coming from "conda install -c "nvidia/label/cuda-11.7.0" cuda" which torch and torchvision seem to be happy with. I've also installed the matching CUDA toolkit 11.7 from https://developer.nvidia.com/cuda-toolkit-archive.

I've gone through a pile of issues, suggestions and solutions and have even tried rebuilding the environment from scratch to no avail, including trying nearly everything mentioned in https://stackoverflow.com/questions/62304087/installing-pytorch3d-fails-with-anaconda-and-pip-on-windows-10.

Now at a total loss as to how to get pytorch3d compiling with CUDA 11.7 on this system. Any ideas?

bottler commented 1 year ago

I don't know. It looks from here like those failing functions require compute capability 9.0 or greater. Which compute capability (e.g. NVCC_ARCH) are you trying to build for? Perhaps you could use 8.0 even if you have a 9.0 GPU.

(I think nothing in PyTorch3D uses cooperative groups btw, but that doesn't stop the imports happening.)

walt-jones commented 1 year ago

Not sure which compute compatibility it’s building for - everything’s run as-is from https://download.pytorch.org/whl/cu117 without modification aside from disabling ninja. Is the compute compatibility set in the setup.py or somewhere else we have control over? I’m not near the build machine right now so can’t check myself.

bottler commented 1 year ago

By default, I think it will choose the compute capability of the GPU you have on the machine. To override, you could set

NVCC_FLAGS="-gencode=arch=compute_80,code=sm_80"
walt-jones commented 1 year ago

Right. The GPU on that machine is a T4 which is compute compatibility 7.5 so nvcc shouldn’t be trying to compile to 9. I’ll try running with the env variable to see if it makes a difference or try on another machine with a GPU that’s at least compute compatibility 8.

walt-jones commented 1 year ago

Ran pip install after each of the following: set NVCC_FLAGS=-gencode=arch=compute_80,code=sm_80 set NVCC_FLAGS=-gencode=arch=compute_75,code=sm_75

Still getting the same undefined identifier errors with both:

C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(418): error: identifier "__clusterGridDimInClusters" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(427): error: identifier "__clusterIdx" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(521): error: identifier "__clusterDimIsSpecified" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(526): error: identifier "__cluster_barrier_arrive" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(531): error: identifier "__cluster_barrier_wait" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(542): error: identifier "__cluster_query_shared_rank" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(553): error: identifier "__clusterRelativeBlockIdx" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(558): error: identifier "__clusterRelativeBlockRank" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(568): error: identifier "__clusterDim" is undefined

      C:\Users\walt\.conda\envs\emoca\include\cooperative_groups/details/helpers.h(573): error: identifier "__clusterSizeInBlocks" is undefined
walt-jones commented 1 year ago

On the suggestion of a colleague, I tried using the CUDA 11.8 libraries instead of 11.7, leaving the CUDA toolkit itself on 11.7, and pytorch3d compiled and installed without any issues. No more undefined identifier errors!