Closed wormwang closed 5 years ago
at same time we meet test failure on HIP Ctests
The following tests FAILED: 12 - directed_tests/deviceLib/hipAsynchronousStreams.tst (Child aborted) 97 - directed_tests/runtimeApi/memory/hipMemset2D.tst (Child aborted) 98 - directed_tests/runtimeApi/memory/hipMemset3D.tst (Child aborted) 110 - directed_tests/runtimeApi/stream/hipStreamCreateWithPriority.tst (Failed) 115 - directed_tests/surface/hipSurfaceObj2D.tst (SEGFAULT) 116 - directed_tests/texture/hipBindTexRef1DFetch.tst (Child aborted) 117 - directed_tests/texture/hipGetChanDesc.tst (Child aborted) 118 - directed_tests/texture/hipTextureObj1DFetch.tst (Child aborted) 119 - directed_tests/texture/hipTextureObj2D.tst (SEGFAULT) 120 - directed_tests/texture/hipTextureRef2D.tst (SEGFAULT) Errors while running CTest
Did the issue linked with the TF error?
runt the bad python script in gdb $ gdb -ex r --args python3 tf-gpu.py gpu 1000
2019-06-13 15:40:27.654097: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-13 15:40:27.654117: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 [New Thread 0xffff177c61f0 (LWP 8391)] [Thread 0xffff177c61f0 (LWP 8390) exited] [New Thread 0xffff177c61f0 (LWP 8392)] [Thread 0xffff177c61f0 (LWP 8391) exited] [New Thread 0xffff177c61f0 (LWP 8393)] [Thread 0xffff177c61f0 (LWP 8392) exited] [Thread 0xffff177c61f0 (LWP 8393) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception
Thread 196 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff187c81f0 (LWP 8388)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
+@sunway513
@wormwang you configuration seems to be pretty outdated (ROCm 2.3 + TF 1.12) and doesn't seem to be something we test in general (UB 18.04). Between ROCm 2.2 - 2.4 there were tremendous changes in underlying runtime & compiler components so I can imagine things go strange in case one component went not aligned.
Wondering could you try the docker container with ROCm 2.5 + TF 1.13? rocm/tensorflow:rocm2.5-tf1.13-python3
could be a good tag to start.
Hi @wormwang , like @whchung suggested please firstly upgrade your rocm installations, especially the rock-dkms package. Please also make sure the HIP unit tests can pass fine in the rocm2.5 TF docker image before trying to run TF scripts.
@wormwang can you post your questions on ROCm github repository please: https://github.com/RadeonOpenCompute/ROCm/issues
reproduce same error with rocm 2.4,while HIP program work well runt the bad python script in gdb $ gdb -ex r --args python3 tf-gpu.py gpu 1000
transpose/perm: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-20 13:51:29.562211: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-20 13:51:29.562229: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 [New Thread 0xffff0bfc71f0 (LWP 17608)] [Thread 0xffff0bfc71f0 (LWP 17607) exited] [New Thread 0xffff0bfc71f0 (LWP 17624)] [Thread 0xffff0bfc71f0 (LWP 17608) exited] [New Thread 0xffff0bfc71f0 (LWP 17625)] [Thread 0xffff0bfc71f0 (LWP 17624) exited] [Thread 0xffff0bfc71f0 (LWP 17625) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception
Thread 197 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff0c7c81f0 (LWP 17606)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
We reproduce the core dump error on other tf script.
runing the helloworld is well
2019-06-22 23:19:17.799338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon RX Vega 64], pci bus id: 0000:33:00.0) b'Hello, TensorFlow!'
we meet core dump when run basic_operations.py
gdb -ex r --args python3 basic_operations.py
Addition with constants: 5 [New Thread 0xffff0bfc71f0 (LWP 10720)] [New Thread 0xffff0bfc71f0 (LWP 10721)] [Thread 0xffff0bfc71f0 (LWP 10720) exited] [New Thread 0xffff0bfc71f0 (LWP 10722)] [Thread 0xffff0bfc71f0 (LWP 10721) exited] [New Thread 0xffff0bfc71f0 (LWP 10723)] [Thread 0xffff0bfc71f0 (LWP 10722) exited] Multiplication with constants: 6 2019-06-22 23:01:23.131450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0 2019-06-22 23:01:23.131508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-22 23:01:23.131523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057] 0 2019-06-22 23:01:23.131536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0: N 2019-06-22 23:01:23.131588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon RX Vega 64], pci bus id: 0000:33:00.0) [New Thread 0xffff35abd1f0 (LWP 10724)] [New Thread 0xffff362be1f0 (LWP 10725)] [Thread 0xffff0bfc71f0 (LWP 10723) exited] [Thread 0xffff35abd1f0 (LWP 10620) exited] [Thread 0xffff362be1f0 (LWP 10619) exited] [New Thread 0xffff0bfc71f0 (LWP 10726)] [New Thread 0xffff0bfc71f0 (LWP 10727)] [Thread 0xffff0bfc71f0 (LWP 10726) exited] [New Thread 0xffff0bfc71f0 (LWP 10728)] [Thread 0xffff0bfc71f0 (LWP 10727) exited] Addition with variables: 5 [New Thread 0xffff0bfc71f0 (LWP 10729)] [Thread 0xffff0bfc71f0 (LWP 10728) exited] [New Thread 0xffff0bfc71f0 (LWP 10730)] [New Thread 0xffff0bfc71f0 (LWP 10731)] [Thread 0xffff0bfc71f0 (LWP 10730) exited] [Thread 0xffff0bfc71f0 (LWP 10729) exited] [Thread 0xffff0bfc71f0 (LWP 10731) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception
Thread 195 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff0d7ca1f0 (LWP 10683)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
@wormwang Looking at the stacktrace I assume something wrong inside HIP implementation on your platform wrt kernel arguments. May I understand are you using the official HIP implementation or do you keep a downstream fork?
I just build the HIP on that git clone HIP of ROCm 2.4 by repo sync.
I don't touch the source code of HIP.
At the other side, some HIP example App run well.
@wormwang Thanks for additional information. Now this ticket is not entirely related to TensorFlow but in HIP. I'll see to what extent I can help you here.
Relevant code in HIP is here: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.4.x/include/hip/hcc_detail/functional_grid_launch.hpp#L100
You can observe there are 2 places a C++ exception may be raised. One is when a kernel (__global__
function) can't be located by HIP runtime, or when its metadata couldn't be located.
If you add traces or breakpoints you should be able to identify which exception was really raised. And you can find corresponding implementation details of HIP at: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.4.x/include/hip/hcc_detail/program_state.hpp
Now in this file you can see it traverses the binary via ELFIO, and I would recommend you add more traces to understand what symbols are missing. Since your platform is different from usual ROCm deployments I can imagine maybe there are some places have to be tuned.
Also in your comment you mentioned "some HIP example app" run well. I would frown upon that. Are you able to get "all" of them to pass on your platform? If not I'd surely start from there.
To the very least, "make tests" in HIP must all pass to ensure you have basic functionality of HIP on your platform. https://github.com/ROCm-Developer-Tools/HIP/tree/roc-2.4.x/tests
make test on ROCm2.4 , then xxx at test 76. {updated test 76 is not hang ,but run on 700s }
75/120 Test #75: directed_tests/runtimeApi/event/hipEventRecord--iterations10.tst ............................... Passed 0.47 sec Start 76: directed_tests/runtimeApi/event/record_event.tst
@wormwang according to the test results the ticket is beyond the scope of TensorFlow, but regarding to getting HIP runtime working properly on your platform.
I’d recommend you get a supported system (ex: x86 + vega10/20 + UB 16.04), install the same version of ROCm, and compare the differences of HIP tests with your platform.
sorry ,some mistaken on test 76 , it is not hang. but run on 700s
latest results on rocm2.4 and 5.0.21 kernel that have amdkfd and amdgpu etc
92% tests passed, 10 tests failed out of 120
Total Test time (real) = 1054.48 sec
The following tests FAILED: 12 - directed_tests/deviceLib/hipAsynchronousStreams.tst (Child aborted) 54 - directed_tests/kernel/hipLaunchParm.tst (Not Run) 97 - directed_tests/runtimeApi/memory/hipMemset2D.tst (Child aborted) 98 - directed_tests/runtimeApi/memory/hipMemset3D.tst (Child aborted) 115 - directed_tests/surface/hipSurfaceObj2D.tst (SEGFAULT) 116 - directed_tests/texture/hipBindTexRef1DFetch.tst (Child aborted) 117 - directed_tests/texture/hipGetChanDesc.tst (Child aborted) 118 - directed_tests/texture/hipTextureObj1DFetch.tst (Child aborted) 119 - directed_tests/texture/hipTextureObj2D.tst (SEGFAULT) 120 - directed_tests/texture/hipTextureRef2D.tst (SEGFAULT) Errors while running CTest Makefile:108: recipe for target 'test' failed make: *** [test] Error 8
Those failing tests should be looked into. But in general shouldn’t block TensorFlow from execution on ROCm. I’d like to ask you to add additional traces or use debugger to understand which C++ exception was raised inside hip_impl::make_kernarg so we know the next step.
Closing this ticket as ROCm doesn't support ARCH64 distro, there's no shortcut for TF-ROCm to be functional on that stack at the moment.
I build tensorflow and HIP with debuginfo. I got more detail stackstace ,but, I can not find out which parameters is error.
(gdb) bt
0xffff563093e0 <tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(tensorflow::random::PhiloxRandom, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>::ResultElementType*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>)>,
actuals=std::tuple containing = {...}) at /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:114
0xffff563093e0 <tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(tensorflow::random::PhiloxRandom, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>::ResultElementType*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>)>, numBlocks=...,
dimBlocks=..., sharedMemBytes=0, stream=0x99b540, args=..., args=..., args=..., args=...) at /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:181
d=..., gen=..., data=0x4102000500, size=1000000, dist=...) at tensorflow/core/kernels/random_op_gpu.cu.cc:225
this=0xffff3bcc5f70, ctx=0xffff03fd64b0) at tensorflow/core/kernels/random_op.cc:204
at tensorflow/core/common_runtime/gpu/gpu_device.cc:548
at tensorflow/core/common_runtime/executor.cc:1782
at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232
at tensorflow/core/lib/core/threadpool.cc:57
I’d like to ask you to add additional traces or use debugger to understand which C++ exception was raised inside hip_impl::make_kernarg so we know the next step.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0:python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior run simple test GPU python script but error at core dump
2019-06-13 14:07:31.880680: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-13 14:07:31.880698: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 terminate called after throwing an instance of 'std::exception' what(): std::exception Aborted (core dumped)
script cat tf-gpu.py import sys import numpy as np import tensorflow as tf from datetime import datetime
device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu shape = (int(sys.argv[2]), int(sys.argv[2])) if device_name == "gpu": device_name = "/gpu:0" else: device_name = "/cpu:0"
with tf.device(device_name): random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation)
startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result)
//It can be hard to see the results on the terminal with lots of output -- add some newlines to improve readability. print("\n" * 5) print("Shape:", shape, "Device:", device_name) print("Time taken:", datetime.now() - startTime)
print("\n" * 5)
Describe the expected behavior But run another python script successfully & HIP Examples also run successfully
//cat test_single_gpu.py
from future import print_function ''' Basic Multi GPU computation example using TensorFlow library. Author: Aymeric Damien Project: https://github.com/aymericdamien/TensorFlow-Examples/ '''
''' This tutorial requires your machine to have 1 GPU "/cpu:0": The CPU of your machine. "/gpu:0": The first GPU of your machine '''
import numpy as np import tensorflow as tf import datetime
// Processing Units logs log_device_placement = True
// Num of multiplications to perform n = 10
''' Example: compute A^n + B^n on 2 GPUs Results on 8 cores with 2 GTX-980:
// Create a graph to store results c1 = [] c2 = []
def matpow(M, n): if n < 1: #Abstract cases where n < 1 return M else: return tf.matmul(M, matpow(M, n-1))
''' Single GPU computing ''' with tf.device('/gpu:0'): a = tf.placeholder(tf.float32, [10000, 10000]) b = tf.placeholder(tf.float32, [10000, 10000])
Compute A^n and B^n and store results in c1
with tf.device('/cpu:0'): sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n
t1_1 = datetime.datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
Run the op.
t2_1 = datetime.datetime.now()
print("Single GPU computation time: " + str(t2_1-t1_1))
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.