dorarad / gansformer

Generative Adversarial Transformers
MIT License
1.33k stars 149 forks source link

kernel error in generate.py #5

Closed yaseryacoob closed 3 years ago

yaseryacoob commented 3 years ago

In a python 3.7, tensorflow-gpu=1.15.0 cuda 10.0 and cuddn 7.5 I get this error in generate.py (which appeared to require cuddn 7.6.5, which brings a different error (see second part). Any advice?

... Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file

........... Total 35894608

Generate images... 0%| | 0/8 [00:01<?, ?image (1 batches of 8 images)/s] Traceback (most recent call last): File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'FusedBiasAct' used by {{node Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct}}with these attrs: [gain=1, T=DT_FLOAT, axis=1, alpha=0, grad=0, act=1] Registered devices: [CPU, XLA_CPU, XLA_GPU] Registered kernels: device='GPU'; T in [DT_HALF] device='GPU'; T in [DT_FLOAT]

     [[Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct]]

CUDNN7.6.5 error .... Total 35894608

Generate images... 0%| | 0/8 [00:01<?, ?image (1 batches of 8 images)/s] Traceback (most recent call last): File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/vulcanscratch/yaser/miniconda3/envs/yygentransformer/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: cudaErrorNoKernelImageForDevice [[{{node Gs/_Run/Gs/G_mapping/global/Dense0_0/FusedBiasAct}}]] [[Gs/_Run/Gs/maps_out/_3151]] (1) Internal: cudaErrorNoKernelImageForDevice [[{{node Gs/_Run/Gs/G_mapping/global/Dense0_0/FusedBiasAct}}]] 0 successful operations. 0 derived errors ignored.

dorarad commented 3 years ago

It seems that there's some issue in your CUDA configuration. When running the python script it begins with compiling custom operations such as FusedBiasAct. From your first error message I suspect the compilation doesn't work successfully.

I recommend looking into the CUDA section in the readme, and testing if the following works for you:

nvcc test_nvcc.cu -o test_nvcc -run

Then try to run the following command (defined here):

# Path to your environment
x=/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow;

# Temp directory
t=/tmp/tmpzizydc04

nvcc "$x/python/_pywrap_tensorflow_internal.so" --compiler-options '-fPIC -D_GLIBCXX_USE_CXX11_ABI=0' 
--gpu-architecture=sm_70 --use_fast_math --disable-warnings --include-path "$x/include" 
--include-path "$x/include/external/protobuf_archive/src" --include-path "$x/include/external/com_google_absl" 
--include-path "$x/include/external/eigen_archive" 2>&1 "dnnlib/tflib/ops/fused_bias_act.cu" 
--shared -std=c++11 -DNDEBUG -o "$t/fused_bias_act_tmp.so" --keep --keep-dir "$t"

If these don't work, see: https://github.com/tensorflow/tensorflow/issues/36426 and https://stackoverflow.com/questions/62901027/cuda-error-209-cudalaunchkernel-returned-cudaerrornokernelimagefordevice for potential solutions for the issues you face.

Hope it helps!

yaseryacoob commented 3 years ago
  1. The cuda paths were fine
  2. the nvcc went fine and the command went fine as well.
  3. The last line is the most relevant but I can't easily debug it, it is some incompatibility in the cuda, and I am suspecting it is a precompiled thing. The network loads fine (in the generate, showing all the layers). It is in the Generate image ... step that the failure happens.

I tried this on a few machines (with new installations in conda and otherwise), so I am bewildered.

dorarad commented 3 years ago

Hmm.. I'm not sure what's the source of the issue, but to test that, could you please see if stylegan2 works for you? The code here extends their codebase and the CUDA part is completely the same so I'm interested to see if that will work

yaseryacoob commented 3 years ago

I got it to work, if by accident. My default gcc is 4.8.5, but I tried 6.3.0 and ninja which created a different error. So I went back to 4.8.5, but left ninja. Cleaned up the compiled library from dnnlib, and it all worked fine. It is not a cuda problem, it is a gcc issue.

When I ran your code on the ffhq the visual quality of the images was poor. So I wonder how you used your code for ffhq type problems?

But at least got over the bug. thanks for the help!!

dorarad commented 3 years ago

Awesome glad to hear it worked well! Did you train it from scratch or used the pre-trained network? The FID score should be 7.42, I verified that locally. It should report you the FID score that it gets.

(If it shows you a different number let me know and I'll look into it)

That should be the quality that it gives you, while trained for 5x less steps than the models in the stylegan2 repository: image

yaseryacoob commented 3 years ago

I used your pretrained model with your quickstart generate command, got no FID score, and here are a few examples. There is the issue of your model being 256x256 while I normally use the 1Kx1K models (in pytorch from Stylegan2 among others).


dorarad commented 3 years ago

Thanks a lot for letting me know, it looks like I have put the wrong pretrained model I'll look into that hopefully today and if not then over the next couple days and get back to you!

In terms of resolution, the reason I worked with 256x256 is because it needs less compute than 1k -- I've done all the experiments at my university so don't have GPUs at company-scales like NVIDIA :)

At the same time, the implementation supports training of higher resolutions, and I've done experiments showing the model maintains good FID scores for higher resolutions too.

dorarad commented 3 years ago

Alright made some updates should be better now:

Changes:

Hope it helps! Let me know if you still face issues!