dorarad / gansformer

Generative Adversarial Transformers
MIT License
1.32k stars 149 forks source link

Issues with docker #7

Closed arsyad-ah closed 3 years ago

arsyad-ah commented 3 years ago

Hi,

I'm trying to dockerize using this image - tensorflow/tensorflow:1.14.0-gpu-py3.

FROM tensorflow/tensorflow:1.14.0-gpu-py3

ARG USER="test"
ARG WORK_DIR="/home/$USER"

WORKDIR $WORK_DIR

RUN apt-get update && apt-get install build-essential

RUN apt-get install ffmpeg libsm6 libxext6  -y

RUN pip install --upgrade pip setuptools wheel

COPY . ./

RUN pip install -r requirements.txt

RUN python generate.py --gpus 0 --model gdrive:bedrooms-snapshot.pkl --output-dir images --images-num 4

However, I am getting this error:

Downloading https://drive.google.com/uc?id=1-2L3iCBpP_cf6T2onf3zEQJFAAzxsQne .... done

2021-04-06 08:32:44 UTC -- Setting up TensorFlow plugin 'upfirdn_2d.cu': Preprocessing... Compiling... Loading... bin_file:  /home/test/dnnlib/tflib/_cudacache/upfirdn_2d_1.14_.so

2021-04-06 08:32:44 UTC -- Failed!

2021-04-06 08:32:44 UTC -- Traceback (most recent call last):

2021-04-06 08:32:44 UTC --   File "generate.py", line 49, in <module>

2021-04-06 08:32:44 UTC --     main()

2021-04-06 08:32:44 UTC --   File "generate.py", line 46, in main

2021-04-06 08:32:44 UTC --     run(**vars(args))

2021-04-06 08:32:44 UTC --   File "generate.py", line 22, in run

2021-04-06 08:32:44 UTC --     G, D, Gs = load_networks(model)                             # Load pre-trained network

2021-04-06 08:32:44 UTC --   File "/home/test/pretrained_networks.py", line 30, in load_networks

2021-04-06 08:32:44 UTC --     G, D, Gs = pickle.load(stream, encoding = "latin1")[:3]

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/network.py", line 306, in __setstate__

2021-04-06 08:32:44 UTC --     self._init_graph()

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/network.py", line 159, in _init_graph

2021-04-06 08:32:44 UTC --     out_expr = self._build_func(*self.input_templates, **build_kwargs)

2021-04-06 08:32:44 UTC --   File "<string>", line 2371, in G_synthesis_stylegan2

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 229, in downsample_2d

2021-04-06 08:32:44 UTC --     return _simple_upfirdn_2d(x, k, down=factor, pad0=(p+1)//2, pad1=p//2, data_format=data_format, impl=impl)

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 358, in _simple_upfirdn_2d

2021-04-06 08:32:44 UTC --     y = upfirdn_2d(y, k, upx=up, upy=up, downx=down, downy=down, padx0=pad0, padx1=pad1, pady0=pad0, pady1=pad1, impl=impl)

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 61, in upfirdn_2d

2021-04-06 08:32:44 UTC --     return impl_dict[impl](x=x, k=k, upx=upx, upy=upy, downx=downx, downy=downy, padx0=padx0, padx1=padx1, pady0=pady0, pady1=pady1)

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 139, in _upfirdn_2d_cuda

2021-04-06 08:32:44 UTC --     return func(x)

2021-04-06 08:32:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/custom_gradient.py", line 162, in decorated

2021-04-06 08:32:44 UTC --     return _graph_mode_decorator(f, *args, **kwargs)

2021-04-06 08:32:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/custom_gradient.py", line 183, in _graph_mode_decorator

2021-04-06 08:32:44 UTC --     result, grad_fn = f(*args)

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 131, in func

2021-04-06 08:32:44 UTC --     y = _get_plugin().up_fir_dn2d(x=x, k=kc, upx=upx, upy=upy, downx=downx, downy=downy, padx0=padx0, padx1=padx1, pady0=pady0, pady1=pady1)

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/ops/upfirdn_2d.py", line 14, in _get_plugin

2021-04-06 08:32:44 UTC --     return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')

2021-04-06 08:32:44 UTC --   File "/home/test/dnnlib/tflib/custom_ops.py", line 162, in get_plugin

2021-04-06 08:32:44 UTC --     plugin = tf.load_op_library(bin_file)

2021-04-06 08:32:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library

2021-04-06 08:32:44 UTC --     lib_handle = py_tf.TF_LoadLibrary(library_filename)

2021-04-06 08:32:44 UTC -- tensorflow.python.framework.errors_impl.NotFoundError: /home/test/dnnlib/tflib/_cudacache/upfirdn_2d_1.14_.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

2021-04-06 08:32:44 UTC -- error building image: error building stage: failed to execute command: waiting for process to exit: exit status 1

Please help to check and advise. Thanks!

dorarad commented 3 years ago

Hi! Thanks for reaching out. In the following line: https://github.com/dorarad/gansformer/blob/main/dnnlib/tflib/custom_ops.py#L130 Try changing int(tf_ver < 1.15) to 0.

Then you should clean the custom ops built so that you can retry: rm -rf /home/test/dnnlib/tflib/_cudacache

and then try to run the code again. Let me know if you keep having issues on that!

Source solution: https://github.com/lmb-freiburg/demon/issues/26

arsyad-ah commented 3 years ago

Thanks for the quick reply!

I managed to solve that issue, but have another when generating images.


2021-04-06 14:15:44 UTC -- Traceback (most recent call last):

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call

2021-04-06 14:15:44 UTC --     return fn(*args)

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn

2021-04-06 14:15:44 UTC --     options, feed_dict, fetch_list, target_list, run_metadata)

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun

2021-04-06 14:15:44 UTC --     run_metadata)

2021-04-06 14:15:44 UTC -- tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'FusedBiasAct' used by {{node Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct}}with these attrs: [gain=1, T=DT_FLOAT, axis=1, alpha=0, grad=0, act=1]

2021-04-06 14:15:44 UTC -- Registered devices: [CPU, XLA_CPU]

2021-04-06 14:15:44 UTC -- Registered kernels:

2021-04-06 14:15:44 UTC --   device='GPU'; T in [DT_HALF]

2021-04-06 14:15:44 UTC --   device='GPU'; T in [DT_FLOAT]

2021-04-06 14:15:44 UTC -- 

2021-04-06 14:15:44 UTC --   [[Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct]]

2021-04-06 14:15:44 UTC -- 

2021-04-06 14:15:44 UTC -- During handling of the above exception, another exception occurred:

2021-04-06 14:15:44 UTC -- 

2021-04-06 14:15:44 UTC -- Traceback (most recent call last):

2021-04-06 14:15:44 UTC --   File "generate.py", line 53, in <module>

2021-04-06 14:15:44 UTC --     main()

2021-04-06 14:15:44 UTC --   File "generate.py", line 46, in main

2021-04-06 14:15:44 UTC --     run(**vars(args))

2021-04-06 14:15:44 UTC --   File "generate.py", line 28, in run

2021-04-06 14:15:44 UTC --     minibatch_size = batch_size, verbose = True)[0]

2021-04-06 14:15:44 UTC --   File "/home/test/dnnlib/tflib/network.py", line 488, in run

2021-04-06 14:15:44 UTC --     mb_out = tf.get_default_session().run(out_expr, dict(zip(in_expr, mb_in)))

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run

2021-04-06 14:15:44 UTC --     run_metadata_ptr)

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run

2021-04-06 14:15:44 UTC --     feed_dict_tensor, options, run_metadata)

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run

2021-04-06 14:15:44 UTC --     run_metadata)

2021-04-06 14:15:44 UTC --   File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call

2021-04-06 14:15:44 UTC --     raise type(e)(node_def, op, message)

2021-04-06 14:15:44 UTC -- tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'FusedBiasAct' used by node Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct (defined at <string>:96) with these attrs: [gain=1, T=DT_FLOAT, axis=1, alpha=0, grad=0, act=1]

2021-04-06 14:15:44 UTC -- Registered devices: [CPU, XLA_CPU]

2021-04-06 14:15:44 UTC -- Registered kernels:

2021-04-06 14:15:44 UTC --   device='GPU'; T in [DT_HALF]

2021-04-06 14:15:44 UTC --   device='GPU'; T in [DT_FLOAT]

2021-04-06 14:15:44 UTC -- 

2021-04-06 14:15:44 UTC --   [[Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct]]

2021-04-06 14:15:44 UTC -- 

2021-04-06 14:15:44 UTC -- Errors may have originated from an input operation.

2021-04-06 14:15:44 UTC -- Input Source operations connected to node Gs/_Run/Gs/G_mapping/AttLayer_0/FusedBiasAct:

2021-04-06 14:15:44 UTC --  Gs/_Run/Gs/G_mapping/AttLayer_0/mul_1 (defined at <string>:273)

2021-04-06 14:15:44 UTC --  Gs/_Run/Gs/G_mapping/AttLayer_0/Const_1 (defined at /home/test/dnnlib/tflib/ops/fused_bias_act.py:99)

2021-04-06 14:15:44 UTC --  Gs/_Run/Gs/G_mapping/AttLayer_0/MatMul (defined at <string>:247)

2021-04-06 14:15:46 UTC -- error building image: error building stage: failed to execute command: waiting for process to exit: exit status 1
dorarad commented 3 years ago

Note that the codebase builts two custom tensorflow operations and it seems that that's the source of the issue. It looks like you might have some mismatch between CUDA and the tensorflow version you use? https://github.com/tensorflow/tensorflow/issues/26600 may be helpful for the issue you mention!

arsyad-ah commented 3 years ago

Yup that is right. Seems to be some mismatch between CUDA and TF, but solved it when I was using docker. Thanks for the help!

dorarad commented 3 years ago

That's great happy to hear that!