deepfakes / faceswap-playground

User dedicated repo for the faceswap project
306 stars 194 forks source link

Convolution algorithm (CUDNN) fails in convert.py, Cuda10+RTX1070 #241

Closed nkdsoft closed 5 years ago

nkdsoft commented 5 years ago

I am trying to run faceswap with TF v1.12 compiled with Cuda 10 & CUDNN 7.4.2. I am using a Geforce RTX 2070 GPU.

At first I was not able to train (cudnn error), but after some investigation I discovered that using the --ag flag (set_tf_allow_growth) it worked OK.

Now the problem is that I can not use the convert.py function. I get a similar error but this function does not have the --ag flag. This is the error i get:

 File "/home/f/faceswap/lib/cli.py", line 90, in execute_script
    process.process()

  File "/home/f/faceswap/scripts/convert.py", line 61, in process
    self.convert(converter, item)

  File "/home/f/faceswap/scripts/convert.py", line 208, in convert
    image = self.convert_one_face(converter, image, face)

  File "/home/f/faceswap/scripts/convert.py", line 224, in convert_one_face
    size)

  File "/home/f/faceswap/plugins/convert/Convert_Masked.py", line 56, in patch_image
    new_face = self.get_new_face(image, mat, size)

  File "/home/f/faceswap/plugins/convert/Convert_Masked.py", line 179, in get_new_face
    new_face = self.encoder(normalized_face)[0]

  File "/home/f/faceswap/plugins/model/Model_Original/Model.py", line 34, in <lambda>
    return lambda img: autoencoder.predict(img)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/keras/engine/training.py", line 1169, in predict
    steps=steps)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 294, in predict_loop
    batch_outs = f(ins_batch)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)

  File "/home/f/anaconda3/envs/fs/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

     [[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/conv2d_1/convolution-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]

looks like a Keras problem maybe... any ideas?

nkdsoft commented 5 years ago

Found a temporary solution! I added this code to convert.py:

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
                                    # (nothing gets printed in Jupyter, only if you run it standalone)
sess = tf.Session(config=config)
set_session(sess)  # set this TensorFlow session as the default session for Keras
torzdf commented 5 years ago

tf 1.13 rc1 is out, so if you don't mind doing some testing, you could try with TF 1.13.

FWIW I run tf 1.12 with CUDA 10 and cuDNN 7.4.2 on a GTX 1080 and I have no issues. Do you mean RTX 2070?

nkdsoft commented 5 years ago

tf 1.13 rc1 is out, so if you don't mind doing some testing, you could try with TF 1.13.

FWIW I run tf 1.12 with CUDA 10 and cuDNN 7.4.2 on a GTX 1080 and I have no issues. Do you mean RTX 2070?

I have the TF 1.13 nightly build in another environment (installed with pip) and I have similar problems when running TF sample programs that use convolution algorithm. I will try with 1.13 rc1 but I don't have much hope.

Yes, I meant RTX 2070. I think its a problem with the new architecture of all the RTX cards (I don't know if the problem is CUDA 10, or CUDNN or something... I read about this problem somewhere else and thats why I found a solution using the allow_growth = True, gpu option.