Cannot utilize multiple CPU cores

abstractdonut commented 3 years ago

Hi-

Thank you for making such a fascinating project available here!

I'm trying to run ganformer within a conda environment, but am having problems getting ganformer to utilize multiple CPU cores.

Using Ubuntu 20.04. Here is the setup for the conda environment used:

conda create --name cuda10 python=3.7
conda activate cuda10
conda install tensorflow-gpu=1.14
conda install pillow h5py requests tqdm termcolor seaborn
pip install opencv-python lmdb gdown easydict

To run it

python gansformer/run_network.py --train --pretrained-pkl None --gpus 0,1 --ganformer-default --expname myDS_256 --dataset myDS --data-dir /data/myDS_256_tf --keep-samples --metrics none --result-dir training_runs/256_c1/ --num-threads 24 --minibatch-size 16

Everything seems to be running correctly, there are no errors or crashes. The only problem is slow training initialization and low GPU utilization during training. System Monitor shows that only one CPU core is used at a time, so I'm guessing this is the cause of both issues. Do you have any ideas of what might be causing the restriction to a single CPU core?

I always try to avoid raising an issue when something obvious might be wrong on my end, but this is my first time using conda so it might be that I'm simply using it incorrectly, or that I'm using your program incorrectly. I appreciate your patience if that is the case.

Thank you for your attention to this issue!

abstractdonut commented 3 years ago

Whoops, I spoke too soon about things running correctly. After a little while the program crashes with

2021-07-23 01:18:30.447650: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_NOT_SUPPORTED
2021-07-23 01:18:30.447674: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2021-07-23 01:31:00.174385: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_NOT_SUPPORTED
2021-07-23 01:31:00.174416: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[16,64,1024], b.shape=[16,16,1024], m=64, n=16, k=1024, batch_size=16
     [[{{node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul}}]]
     [[GPU1/Mean_1/_14191]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[16,64,1024], b.shape=[16,16,1024], m=64, n=16, k=1024, batch_size=16
     [[{{node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul}}]]
0 successful operations.
1 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gansformer/run_network.py", line 532, in <module>
    main()
  File "gansformer/run_network.py", line 529, in main
    run(**vars(args))
  File "gansformer/run_network.py", line 334, in run
    dnnlib.submit_run(**kwargs)
  File "/home/abstract/projects/gansformer/dnnlib/submission/submit.py", line 346, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/home/abstract/projects/gansformer/dnnlib/submission/internal/local.py", line 16, in submit
    return run_wrapper(submit_config)
  File "/home/abstract/projects/gansformer/dnnlib/submission/submit.py", line 254, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/home/abstract/projects/gansformer/training/training_loop.py", line 349, in training_loop
    cG.lossvals.update(tflib.run([cG.train_op, cG.ops], feed_dict)[1])
  File "/home/abstract/projects/gansformer/dnnlib/tflib/tfutil.py", line 23, in run
    return tf.get_default_session().run(*args, **kwargs)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[16,64,1024], b.shape=[16,16,1024], m=64, n=16, k=1024, batch_size=16
     [[node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul (defined at /home/abstract/projects/gansformer/training/network.py:714) ]]
     [[GPU1/Mean_1/_14191]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[16,64,1024], b.shape=[16,16,1024], m=64, n=16, k=1024, batch_size=16
     [[node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul (defined at /home/abstract/projects/gansformer/training/network.py:714) ]]
0 successful operations.
1 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul:
 GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/Tile (defined at /home/abstract/projects/gansformer/training/network.py:615)

Input Source operations connected to node GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul:
 GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/Tile (defined at /home/abstract/projects/gansformer/training/network.py:615)

Original stack trace for 'GPU0/G_loss/G/G_synthesis/8x8/Conv0_up/AttLayer_l2n/iter_0/MatMul':
  File "gansformer/run_network.py", line 532, in <module>
    main()
  File "gansformer/run_network.py", line 529, in main
    run(**vars(args))
  File "gansformer/run_network.py", line 334, in run
    dnnlib.submit_run(**kwargs)
  File "/home/abstract/projects/gansformer/dnnlib/submission/submit.py", line 346, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/home/abstract/projects/gansformer/dnnlib/submission/internal/local.py", line 16, in submit
    return run_wrapper(submit_config)
  File "/home/abstract/projects/gansformer/dnnlib/submission/submit.py", line 254, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/home/abstract/projects/gansformer/training/training_loop.py", line 272, in training_loop
    reals = reals, minibatch_size = minibatch_gpu_in, **cG.loss_args)
  File "/home/abstract/projects/gansformer/dnnlib/util.py", line 234, in call_func_by_name
    return func_obj(*args, **kwargs)
  File "/home/abstract/projects/gansformer/training/loss.py", line 23, in G_loss
    fake_imgs_out = G.get_output_for(latents, labels, is_training = True)[0]
  File "/home/abstract/projects/gansformer/dnnlib/tflib/network.py", line 231, in get_output_for
    out_expr = self._build_func(*final_inputs, **build_kwargs)
  File "/home/abstract/projects/gansformer/training/network.py", line 948, in G_GANformer
    is_training = is_training, force_clean_graph = is_template_graph, **kwargs)
  File "/home/abstract/projects/gansformer/dnnlib/tflib/network.py", line 231, in get_output_for
    out_expr = self._build_func(*final_inputs, **build_kwargs)
  File "/home/abstract/projects/gansformer/training/network.py", line 1432, in G_synthesis
    x, dlatents, _att_maps, att_vars = block(x, res, dlatents, dim = nf(res-1), att_vars = att_vars)
  File "/home/abstract/projects/gansformer/training/network.py", line 1341, in block
    dim = dim, kernel = 3, up = up, att_vars = att_vars)
  File "/home/abstract/projects/gansformer/training/network.py", line 1300, in layer
    name = "l2n", **kwargs)
  File "/home/abstract/projects/gansformer/training/network.py", line 714, in transformer_layer
    att_scores = tf.matmul(from_elements * w, to_centroids, transpose_b = True)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 2609, in matmul
    return batch_mat_mul_fn(a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1677, in batch_mat_mul_v2
    "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/abstract/anaconda3/envs/python3.7/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Not sure if this is connected to the original issue though or if it's just a problem with my TF installation.

dvschultz commented 3 years ago

I’m having the same issue (Testing on Colab: TF 1.15.2, CUDA 11.0)

dvschultz commented 3 years ago

fixed with https://github.com/dorarad/gansformer/issues/8#issuecomment-816580781

(Been a minute since I used StyleGAN2-TF, should have remembered this issue!)

dorarad commented 2 years ago

Hi guys! Apologies for the large delay in my response and thanks so much for the kind words!

I recommend indeed trying out the solution referred by above in the comment, to make sure you work in a clean environment .

I don't have experience with running TF over multiple CPU cores, just to make sure based on the context, do you mean GPUs right? In that case I recommend adding:

"gpu_options.allow_growth" : True,

to line https://github.com/dorarad/gansformer/blob/main/run_network.py#L110 As it seems from your error (Blas xGEMMBatched launch failed) to be a memory error over the GPU (see https://stackoverflow.com/questions/43990046/tensorflow-blas-gemm-launch-failed for discussion of a similar issue). Hope it helps and do let me know if you have any further issues with that or if it got resolved! :-)

dorarad / gansformer

Cannot utilize multiple CPU cores #22