NVlabs / stylegan2-ada-pytorch

StyleGAN2-ADA - Official PyTorch implementation
https://arxiv.org/abs/2006.06676
Other
4.07k stars 1.16k forks source link

upfirdn2d_plugin Problem #39

Closed ghost closed 3 years ago

ghost commented 3 years ago

Describe the bug Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

Please stop closing people's issues without a confirmed fix for this problem. #2 (comment) does not work and there is no confirmed fix on that issue that was closed without a confirmed fix.

Please be serious about it and let's work together for a fix instead of ignoring the problem and referring people to a close topic that does not offer any solution to their problem.

We tried everything proposed we also tried both Cuda 11.0 and 11.1, with different version of PyTorch just in case. We are a team of 5 people and we all had the same problem in both Windows and Linux machine and even in google Collab which tells me that this is more than just a configuration problem.

and no %pip install ninja did not solve the problem in any of the machines we have in our lab. also, using verbosity = 'full' does not seem to include any additional helpful information.

Desktop (please complete the following information):

Those are the two machines I used

Machine 1

Machine 2

nurpax commented 3 years ago

Ok, thanks for filing a separate bug. I’ll keep this one open. There are multiple different problems filed into separate bugs with comments about separate issues added into the same bug. So it gets messy.

Trying to use class AugmentPipe in my project. Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

Can you give a bit of detail about your project structure? How are you making use of AugmentPipe in your project?

Is unmodified stylegan2-ada-pytorch project working for you?

ghost commented 3 years ago

Hi @nurpax disregard the first line, It was written for something else, I updated my issue with more details.

nurpax commented 3 years ago

Just double checking: your version of stylegan2-ada-pytorch is unmodified and it still does not work?

If you run it in Docker, does it work then? Most users have no issue when running in Docker so you should check if that works and report here. (I understand some people don’t like using Docker but it’s good debug info to check if it works or not.)

Clearly one of the key problems with these custom extensions is that when something goes wrong in their build or first use, the error message throws away too much information about what exactly went wrong.

ghost commented 3 years ago

Yes Correct, I haven't made any changes to it. I just this morning cleaned my driver and made a fresh install, created a new anaconda env and downloaded a fresh copy from this repo but the same problem happens. I don't know why.

nurpax commented 3 years ago

I think you've done this step but I'm adding it here for completeness, even if it may sound like I'm just repeating the same thing over and over.

The simplest form of getting this error:

"Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!"

is when there's no ninja installed with pip install ninja or conda install ninja. The error message unfortunately doesn't give any indication that ninja is missing.

I'm mentioning this here as %pip install ninja from the bug description seems to refer to Colab.

Also: can you please confirm that it works for you in Docker?

ghost commented 3 years ago

I actually tried both pip install ninja and conda install ninja with similar outcome. for Docker, no I haven't tried it.

SofianeBenkara commented 3 years ago

@nurpax

I have been dealing with the same problem.

when I try to generate it works fine but it is slow and this is my output

Loading networks from "../../Data/ffhq.pkl"...
Generating image for seed 8201 (0/1) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

then at the end it does generate an image successfully,

for training and projecting, I get this

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Evaluating metrics...

It then get stuck at Evaluating metrics... then the kernel dies

when I try to project I get this

Loading networks from "..\..\Data\ffhq.pkl"...
Computing W midpoint and stddev using 10000 samples...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/vgg16.pt ... done
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

this continue for few minutes then the kernel dies also.

I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. I have read that other people are having the same problem on reddit and no one is sure what's the problem.

nurpax commented 3 years ago

What seems to be happening is that either the extension build somehow fails or the built extension is not able to run somehow. The pytorch code then will try to fallback to a reference implementation that is slower. It looks like this fallback mechanism is not working all too well, as it's trying to build on every invocation. This probably explains why it's so super slow.

I'd prefer if we'd find a real fix for this, of course, but here's one thing you could try. You could force the custom ops to always use the slower reference path. This will be slower but it should work.

I haven't tried this in a while, but I think you can force the reference implementation by editing the below function (and all the other similar _init functions in that folder):

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/torch_utils/ops/bias_act.py#L41

def _init():
    global _inited, _plugin
    if not _inited:
        _inited = True
        sources = ['bias_act.cpp', 'bias_act.cu']
        sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
        try:
            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
        except:
            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
    return _plugin is not None

to just:

def _init():
    return False

@SBenkara is your repro on Docker or native installation of PyTorch and CUDA? What about the folks on Reddit?

SofianeBenkara commented 3 years ago

@nurpax I haven't tried Linux nor Docker. I am using a Windows 10 with an RTX 3090 GPU and a native installation of PyTorch and Cuda 11.1 and followed all the step on the read me.

I will try if i could find the reddit post and linked it but most people there were using Windows/Linux and I don't remember seeing Docker related issue.

any idea why the extension build is failing? is there any logs i can get that would help?

I will try to make the changes you suggested for now until we fix this issue.

SofianeBenkara commented 3 years ago

I have some updates hopefully then can help in pinpointing the problem.

I forgot to mention that I was using Jupyter notebook. I am not sure what difference it makes but I didn't have any of those issues when I tried using a command line or PyCharm, I just did a pip install and everything started working flawlessly.

The problem might be related to either the Jupyter notebook or Anaconda. I made sure to create more environments to make sure that was not a problem with my anaconda env, but they all failed.

so I made the changes you suggested, it printed less line of Setting up PyTorch plugin "upfirdn2d_plugin"... Failed! without being able to do a projection or training as the kernel continued to stop was still getting the same error from other parts of the code, from \upfirdn2d.py mainly.

No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Edit: after working just fine from my command line for few minute, it's now back at throwing the same error message without me making any changes

UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use S
etuptools' objects directly or at least import Setuptools fir

No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
C:\Users\admin\Google Drive\stylegan2-ada-pytorch-main\stylegan2-ada-pytorch-main\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
nurpax commented 3 years ago

@SBenkara @DarXT3mpla4 can you try patching your stylegan2-ada-pytorch code as follows:

diff --git a/torch_utils/ops/bias_act.py b/torch_utils/ops/bias_act.py
index b092c7f..b6190f8 100755
--- a/torch_utils/ops/bias_act.py
+++ b/torch_utils/ops/bias_act.py
@@ -44,10 +44,7 @@ def _init():
         _inited = True
         sources = ['bias_act.cpp', 'bias_act.cu']
         sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
-        try:
-            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
-        except:
-            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
+        _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
     return _plugin is not None

 #----------------------------------------------------------------------------
diff --git a/torch_utils/ops/upfirdn2d.py b/torch_utils/ops/upfirdn2d.py
index f768b2c..76ac2d6 100755
--- a/torch_utils/ops/upfirdn2d.py
+++ b/torch_utils/ops/upfirdn2d.py
@@ -28,10 +28,7 @@ def _init():
     if not _inited:
         sources = ['upfirdn2d.cpp', 'upfirdn2d.cu']
         sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
-        try:
-            _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
-        except:
-            warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
+        _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
     return _plugin is not None

 def _parse_scaling(scaling):

I.e., remove try/excepts from around the custom_ops.get_plugin() call.

It looks like some exception info is getting lost with the way try/except is written. For example, if I rename my ninja executable in my anaconda3 dirs and rerun with this change, I get a more informative stacktrace. With some luck, maybe this will reveal some new information about the error you are seeing.

Generating image for seed 85 (0/4) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py:50: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:

Ninja is required to load C++ extensions
  warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 491, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 463, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 397, in forward
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 291, in forward
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/misc.py", line 101, in decorator
    return fn(*args, **kwargs)
  File "<string>", line 72, in modulated_conv2d
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/misc.py", line 101, in decorator
    return fn(*args, **kwargs)
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/conv2d_resample.py", line 139, in conv2d_resample
    x = upfirdn2d.upfirdn2d(x=x, f=f, padding=[px0+pxt,px1+pxt,py0+pyt,py1+pyt], gain=up**2, flip_filter=flip_filter)
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 159, in upfirdn2d
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 31, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/home/janne/dev/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 986, in load
    return _jit_compile(
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1268, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/janne/anaconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1323, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
SofianeBenkara commented 3 years ago

this is what I am getting now, also it just crash without any output

C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Error building extension 'upfirdn2d_plugin': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=upfirdn2d_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math -c "C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.cu" -o upfirdn2d.cuda.o 
FAILED: upfirdn2d.cuda.o 
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=upfirdn2d_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math -c "C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.cu" -o upfirdn2d.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
C:\Users\admin\Google Drive\PyTorch\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

No module named 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
nurpax commented 3 years ago

@SBenkara I guess you left the warnings.warn line there? My patch above had that taken out too.

Nevertheless, the error is a little more apparent now (emphasis mine):

Error building extension 'upfirdn2d_plugin': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\ v11.0 \bin nvcc fatal : Unsupported gpu architecture 'compute_86'

CUDA 11.0 does not support compiling for compute_86 arch, to build for compute_86, you need CUDA 11.1. You can see from above that it's building with CUDA 11.0 nvcc.

Another way to verify what compiler versions and flags are actually used, you can check the build.ninja files under ~/.cache/torch_extensions/ (e.g., bias_act_plugin/build.ninja). I'm not sure where exactly this file resides on Windows. Please attach or copy&paste the full contents of one of these files here.

SofianeBenkara commented 3 years ago
ninja_required_version = 1.3
cxx = cl
nvcc = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc

cflags = -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc
post_cflags = 
cuda_cflags = -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\TH -IC:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\admin\anaconda3\envs\ptx\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --use_fast_math
cuda_post_cflags = 
ldflags = /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\anaconda3\envs\ptx\libs /LIBPATH:C:\Users\admin\anaconda3\envs\ptx\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib/x64" cudart.lib

rule compile
  command = cl /showIncludes $cflags -c $in /Fo$out $post_cflags
  deps = msvc

rule cuda_compile
  command = $nvcc $cuda_cflags -c $in -o $out $cuda_post_cflags

rule link
  command = "C$:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\Hostx64\x64/link.exe" $in /nologo $ldflags /out:$out

build bias_act.o: compile C$:\Users\admin\Google$ Drive\stylegan2-ada-pytorch-main\torch_utils\ops\bias_act.cpp
build bias_act.cuda.o: cuda_compile C$:\Users\admin\Google$ Drive\stylegan2-ada-pytorch-main\torch_utils\ops\bias_act.cu

build bias_act_plugin.pyd: link bias_act.o bias_act.cuda.o

default bias_act_plugin.pyd
nurpax commented 3 years ago

Yes, definitely confirms that 11.0 is being used instead of 11.1.

What you will need is to install CUDA 11.1 toolkit from NVIDIA and make sure that you set it up so that 11.1 version comes up first in PATH. E.g., try running "nvcc --version" and check that it's the right version. On my computer this reports something like this:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
SofianeBenkara commented 3 years ago

my nvcc --version returns

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:12:04_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0

all my environment variable are pointing to cuda_11.1

I am not understanding where 11.0 is coming from. I used it before but then switch to 11.1

I deleted the bias_act_plugin/build.ninja and tried again and indeed it shows 11.0

I will keep you posted

SofianeBenkara commented 3 years ago

@nurpax you were 100% right. even though my nvcc --version was returning 11.1 somehow the 11.0 was being used.

I had a both versions installed on my computer but my environment only pointing to the 11.1 After uninstalling the 11.0 version and rebooting the computer everything is working great!! without any issue

Thank you so much!

nurpax commented 3 years ago

I pushed change https://github.com/NVlabs/stylegan2-ada-pytorch/commit/25063950fc72acaa9740ba936379b36df1c5e456 that improves error reporting. Hopefully custom extension build errors get correctly reported now and root causing these problems will be easier.

ghost commented 3 years ago

@nurpax I had three version of Cuda installed 10, 11.1 and 11.2 as I was using those extensions for other projects I had to delete those other version to make the project work. thanks for helping. I wonder if it has anything to do whit this

nurpax commented 3 years ago

Great!

I wonder if it has anything to do whit this

I can’t tell without seeing logs with exception info or build.ninja files for failed attempts.

At least in SBenkara’s case, a wrong version of nvcc was chosen. I assume there were multiple CUDA versions in PATH. I don’t know if there are bugs in CUDA tools discovery code in PyTorch.

avshalomman commented 3 years ago

@nurpax Jumping on this thread since I think I'm experiencing something related, hope it's ok...

I'm training on colab, using the following prompt: python train.py --outdir=training-runs --data=/dataset --gpus=1 --cfg=paper256 --mirror=1 --resume=ffhq256 --snap=1

The dataset contains only 10 photos, so I'm basically trying transfer learning with small data.

I encountered the "Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!" issue at first, which I resolved by installing ninja.

However, training speed is still super-slow, and the issue seems to be in the "Evaluating metrics" part. I did as you suggested and edited the init methods of the custom cuda ops, and it didn't help.

These are the evaluation stats: "metric": "fid50k_full", "total_time": 813.027854681015, "total_time_str": "13m 33s" Which seem awfully slow for a 10 images dataset.

I'm running on colab, cuda version 11.2, T4 GPU.

Thanks in advance!

nurpax commented 3 years ago

@avshalomman Please file separate bugs for separate issues. You can try with --metrics=none, most likely it's computing metrics that'ts taking a long time for you.

Closing this bug as both plugin issues seem to have been resolved.

cunicode commented 3 years ago

installing gcc in the linux machine solved the for "_No module named 'upfirdn2dplugin'" for me.

check if you have gcc: gcc --version if not, install it with sudo apt install build-essential

lucky7323 commented 3 years ago

work with just install ninja for me. pip3 install ninja

metaphorz commented 3 years ago

I am running on a CentOS platform and got the stylegan2-ada-pytorch notebook to work fine except when it reaches the training stage "python train.py ....". I am getting errors for both bias_act_plugin and upfirdn2d_plugin. I have tried some of the suggestions here but wonder if there is a resolution? Perhaps I am not using the right version of CUDA or Pytorch? My Pytorch is 1.7.1. Here is where the errors and tracebacks begin:

Constructing networks... starting G epochs: 0.0 starting G epochs: starting G epochs: 0.00.0 starting G epochs: 0.0 Resuming from "./pretrained/wikiart.pkl" Setting up PyTorch plugin "bias_act_plugin"... Failed! ....deleted path...orch_utils/ops/bias_act.py:50: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation.

thusinh1969 commented 3 years ago

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

youngjae-git commented 3 years ago

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

Thank you Steve !! Finally, solve a problem.

alirezag commented 3 years ago

Simply installing ninja solved this for me. I'm on cuda 11.1.

stossenbrink commented 3 years ago

Hope this helps someone: I solved this issue by installing nvidia-cuda-toolkit (via apt), removed ninja from my pipenv and installed it again. After restarting my jupyter python kernel, the modules where built.

lennysunreal commented 3 years ago

Hope this helps someone: I solved this issue by installing nvidia-cuda-toolkit (via apt), removed ninja from my pipenv and installed it again. After restarting my jupyter python kernel, the modules where built.

Sorry, Imma a complete noob. How do uninstall and re-install ninja? Also When you say "installing nvidia-cuda-toolkit (via apt)" do you mean just download the latest windows tool kit exe and install it or do you mean install it via command line in the powershell?

alirezag commented 3 years ago

Are you familiar with pip? pip uninstall ninja should do it.

darrelfrancis commented 3 years ago

Summary of steps I carried out that worked

  1. pip uninstall ninja
  2. pip install ninja
  3. rm -rf ~/.cache/torch_extensions/*

I actually think it is #3 that worked for me. Next time I ran the python code, it reported that it was installing those two extensions, and all went well.

Feywell commented 3 years ago

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run. Re-run train.py will rebuild those plugins.

Took me a couple of hours!

Steve

Thank you! It is the truth.

colt18 commented 3 years ago

Summary of steps I carried out that worked

1. pip uninstall ninja

2. pip install ninja

3. rm -rf ~/.cache/torch_extensions/*

I actually think it is #3 that worked for me. Next time I ran the python code, it reported that it was installing those two extensions, and all went well.

Can I get the windows path for "~/.cache/torch_extensions/*".

nurpax commented 3 years ago

Try C:\Users\<username>\AppData\Local\torch_extensions\torch_extensions\Cache.

zzningxp commented 2 years ago

My problem is: when I use ONE GPU to train, there is not any problems. when I use TWO GPU to train, it comes such problems. I have tried the methods above, but no. ubuntu 18.04 nvcc = 10.1, V10.1.105

Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
/home//stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "/home//stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/home//stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1091, in load
    keep_intermediates=keep_intermediates)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1302, in _jit_compile
    is_standalone=is_standalone)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1378, in _write_ninja_file_and_build_library
    check_compiler_abi_compatibility(compiler)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 282, in check_compiler_abi_compatibility
    if not check_compiler_ok_for_platform(compiler):
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 242, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/miniconda3/envs/py37torch17/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc()) 
NickDienemann commented 1 year ago

Remove ~/.cache/torch_extensions/* if you have installed some new version of torch or torch vision or whatever in between 2 run.

I have been working with style gan 2 ada for a couple of weeks and everything worked perfectly fine. However, this morning, the upfird2nd_plugin was not able to build the cuda kernels anymore and got stuck in the "Setting up PyTorch plugin "upfirdn2d_plugin"... " prompt.

Deleting the cache files as Steve @thusinh1969 has proposed fixed the issue, thanks a lot :)

I am very confused tho on how this problem arrises after having a working implementation and not changing anything. Maybe the build fails with a low probability and when it fails, it causes subsequent builds to fail aswell?

MeieiShaw commented 1 year ago

I do not know if this will help or not, I added my PyTorch version here in the file conv2d_gradfix.py

def _should_use_custom_op(input):
    assert isinstance(input, torch.Tensor)
    if (not enabled) or (not torch.backends.cudnn.enabled):
        return False
    if input.device.type != 'cuda':
        return False
    if any(torch.__version__.startswith(x) for x in ['1.7.', '1.8.', '1.9', **_'YOUR OWN VERSION HERE'_**]):
        return True
    warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().')
    return False  
mhgenerate commented 1 year ago

Hopefully this can help someone, but this is how I fixed my error.

Firstly I deleted the cache plugin folders > touch_extensions/cache/bias_act_plugin & upfirdn2d_plugin

I had multiple CUDA toolkits in PATH (11.2 and 11.8) I had to delete 11.8 and then ran the code again and it worked perfectly. It might be different for you but if you have multiple paths, it could be the issue.

OP: https://github.com/NVlabs/stylegan2-ada-pytorch/issues/67#issuecomment-798766230

philadias commented 1 year ago

In case it helps someone, in my case I actually just had to run it twice for things to work.

I ended up here trying to config a project that builds upon this repo (https://github.com/voletiv/mcvd-pytorch), and hitting the same old torch_extensions/[...]/upfirdn2d.so: cannot open shared object file: No such file or directory .

In my case, the fix (or maybe more of a workaround?) is that i had to run twice. The first time it would throw the error, but the .so was actually generated in the folder, so when running a second time it actually got to run fine. Since i'm running with multiple parallel devices (GPUs), my takeaway is that during the first run the lack of sync led to some worker to not find the .so file while it was still being generated. For the second run onwards, all workers are able to find it properly

LiangSylar commented 1 year ago

Try !pip install ninja==1.10.2 instead of !pip install ninja. This solves the problem for me. I had the same issue before, but specifying the ninja version definitely solved the problem in my case.

xingyouxin commented 11 months ago

@nurpax

I have been dealing with the same problem.

when I try to generate it works fine but it is slow and this is my output

Loading networks from "../../Data/ffhq.pkl"...
Generating image for seed 8201 (0/1) ...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

then at the end it does generate an image successfully,

for training and projecting, I get this

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
...
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Evaluating metrics...

It then get stuck at Evaluating metrics... then the kernel dies

when I try to project I get this

Loading networks from "..\..\Data\ffhq.pkl"...
Computing W midpoint and stddev using 10000 samples...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/vgg16.pt ... done
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
..
..
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

this continue for few minutes then the kernel dies also.

I hope this help. I have tried all the solution proposed in the other issues opened and was not able to get this working. I have read that other people are having the same problem on reddit and no one is sure what's the problem.

遇到了相同的问题。我使用的是Linux平台,最终的解决办法是:

  1. 完全卸载所有的cuda相关的内容;
  2. 在虚拟环境中(私人用户)和虚拟环境外(root用户)统一安装相同的cuda版本;
  3. cuda版本符合显卡的要求,比如:我用的RTX4090,采用的CUDA版本是11.7;
  4. 注意:虚拟环境外,先装nVidia驱动(注意,nvidia-smi输出的cuda版本和我们后面装的cuda版本不会冲突,他们代表了一个是驱动的cuda版本,一个是runtime的cuda版本),符合系统要求的新版本即可,再装cuda,安装时候跳过device(也就是nVidia驱动);虚拟环境内,在torch官网找对应cuda版本的控制台命令安装即可。 我的做法可以成功运行。具体出现【Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!】的原因,我理解的是虚拟环境内外的cuda版本不一致,同时即使是外部卸载干净了cuda,仍会爆出相同的错误,所以有可能是虚拟环境内在运行cuda的时候仍然涉及到了外部cuda的调用,所以我采用了一种干脆的能解决问题的办法就是保证外部cuda版本和内部cuda版本一致。
yusufbtanriverdi commented 10 months ago

speaking from the future, I have this problem with

cuda 12.1 w10 python 3.9 pytorch 2.1.0+cu121

I will change Cuda version to see if I can make it work

meyurtsever commented 10 months ago

After reverting back to Ubuntu 20.04 LTS, I've managed to make it work without any problem. Cuda 11.2 Nvidia Driver 460.27.04 torch 1.10.0+cu111 torchvision 0.11.1+cu111 Python 3.7

I also applied the changes in this PR: https://github.com/NVlabs/stylegan2-ada-pytorch/pull/197

For installing Nvidia drivers and CUDA, I followed this: https://yakhyo.medium.com/cuda-11-2-installation-on-ubuntu-20-04-e83f7561ccc1

aA13142968398 commented 4 weeks ago

What seems to be happening is that either the extension build somehow fails or the built extension is not able to run somehow. The pytorch code then will try to fallback to a reference implementation that is slower. It looks like this fallback mechanism is not working all too well, as it's trying to build on every invocation. This probably explains why it's so super slow.

I'd prefer if we'd find a real fix for this, of course, but here's one thing you could try. You could force the custom ops to always use the slower reference path. This will be slower but it should work.

I haven't tried this in a while, but I think you can force the reference implementation by editing the below function (and all the other similar _init functions in that folder):

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/torch_utils/ops/bias_act.py#L41

def _init():
    global _inited, _plugin
    if not _inited:
        _inited = True
        sources = ['bias_act.cpp', 'bias_act.cu']
        sources = [os.path.join(os.path.dirname(__file__), s) for s in sources]
        try:
            _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
        except:
            warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
    return _plugin is not None

to just:

def _init():
    return False

@SBenkara is your repro on Docker or native installation of PyTorch and CUDA? What about the folks on Reddit?

It works! Maybe this means "if you can't use it, you close it"