CHTC / templates-GPUs

Template job submissions using GPUs in CHTC
MIT License
38 stars 11 forks source link

TensorFlow Docker example on Ampere GPUs #10

Open agitter opened 3 years ago

agitter commented 3 years ago

Our docker/tensorflow_python/ example fails on the A100 servers in CHTC with the error:

2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "test_tensorflow.py", line 41, in <module>
    sess.run(productg.op)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
         [[node MatMul (defined at test_tensorflow.py:22) ]]
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
 Variable/read (defined at test_tensorflow.py:20)
 Variable_1/read (defined at test_tensorflow.py:21)

I resolved the error by switching to the latest TensorFlow Docker image (2.4.1-gpu) and adding the two lines of TensorFlow migration code:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

We need to consider how to update this example. Should we have a TensorFlow 1.x example and a separate 2.x example? Do we need to constrain the servers these examples are all compatible with?

agitter commented 2 years ago

If https://github.com/CHTC/templates-GPUs/pull/11 works as expected, we can modify this TensorFlow example to add a requirement that CUDACapability < 8.