dorarad / gansformer

Generative Adversarial Transformers
MIT License
1.32k stars 149 forks source link

Setting up TensorFlow plugin 'fused_bias_act.cu': Loading... Failed! #32

Closed kwhuang88228 closed 2 years ago

kwhuang88228 commented 2 years ago

Hi Drew, I'm getting the following error both when I train a GANformer model on the clevr dataset from scratch or when I fine-tune a pretrained model. I didn't have this issue before the repo was updated with PyTorch implementation. I've also tried this and this without luck. Do you have any ideas?

Environment: Python 3.6.13 tensorflow-gpu 1.14.0 CUDA 9.1 cudnn 7

Start model training from scratch
Local submit - run_dir: results/clevr-scratch-000
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset datasets...
Dataset shape:  [3, 256, 256]
Dynamic range:  [0, 255]
Constructing networks...
Setting up TensorFlow plugin 'fused_bias_act.cu': Loading... Failed!
Traceback (most recent call last):
  File "run_network.py", line 556, in <module>
    main()
  File "run_network.py", line 553, in main
    run(**vars(args))
  File "run_network.py", line 368, in run
    dnnlib.submit_run(**kwargs)
  File "/datadrive/kwhuang/gansformer/dnnlib/submission/submit.py", line 346, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/datadrive/kwhuang/gansformer/dnnlib/submission/internal/local.py", line 16, in submit
    return run_wrapper(submit_config)
  File "/datadrive/kwhuang/gansformer/dnnlib/submission/submit.py", line 254, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/datadrive/kwhuang/gansformer/training/training_loop.py", line 194, in training_loop
    label_size = dataset.label_size, **cG.args)
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/network.py", line 100, in __init__
    self._init_graph()
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/network.py", line 159, in _init_graph
    out_expr = self._build_func(*self.input_templates, **build_kwargs)
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 868, in Generator
    components.synthesis = tflib.Network("G_synthesis", func_name = globals()[synthesis_func], **kwargs)
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/network.py", line 100, in __init__
    self._init_graph()
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/network.py", line 159, in _init_graph
    out_expr = self._build_func(*self.input_templates, **build_kwargs)
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 1423, in G_synthesis
    kernel = 3, att_vars = att_vars)
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 1267, in layer
    resample_kernel = resample_kernel, fused_modconv = _fused_modconv, modulate = style, noconv = noconv)
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 390, in modulated_conv2d_layer
    s = dense_layer(y, dim = get_shape(x)[1], weight_var = mod_weight_var, bias_var = mod_bias_var) + 1 # [BI]
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 77, in dense_layer
    x = apply_bias_act(x, act, lrmul, bias_var, name)
  File "/datadrive/kwhuang/gansformer/training/networks.py", line 85, in apply_bias_act
    return fused_bias_act(x, b = b, act = act)
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/ops/fused_bias_act.py", line 62, in fused_bias_act
    return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain)
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/ops/fused_bias_act.py", line 116, in _fused_bias_act_cuda
    cuda_kernel = _get_plugin().fused_bias_act
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/ops/fused_bias_act.py", line 10, in _get_plugin
    return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
  File "/datadrive/kwhuang/gansformer/dnnlib/tflib/custom_ops.py", line 156, in get_plugin
    plugin = tf.load_op_library(bin_file)
  File "/anaconda/envs/gansformer/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /datadrive/kwhuang/gansformer/dnnlib/tflib/_cudacache/fused_bias_act_1.14_.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
dorarad commented 2 years ago

Hi, Thanks for reaching out!

I recommend in the following line: https://github.com/dorarad/gansformer/blob/main/dnnlib/tflib/custom_ops.py#L130 try changing int(tf_ver < 1.15) to 0.

Then you should clean the custom ops built so that you can retry:

rm -rf /external_code/gan/gansformer/dnnlib/tflib/cudacache/

and then try to run the code again. See the following issues (https://github.com/dorarad/gansformer/issues/7, https://github.com/dorarad/gansformer/issues/8) for further discussion and let me know if the solution works!

kwhuang88228 commented 2 years ago

Thanks Drew! Deleting the cuda cache did the trick

dorarad commented 2 years ago

Awesome! :-)