ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 94 forks source link

[aarch64] Run python meet core dump failure on TF 1.12.2 with rocm 2.4 #506

Closed wormwang closed 5 years ago

wormwang commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior run simple test GPU python script but error at core dump

2019-06-13 14:07:31.880680: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-13 14:07:31.880698: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 terminate called after throwing an instance of 'std::exception' what(): std::exception Aborted (core dumped)

script cat tf-gpu.py import sys import numpy as np import tensorflow as tf from datetime import datetime

device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu shape = (int(sys.argv[2]), int(sys.argv[2])) if device_name == "gpu": device_name = "/gpu:0" else: device_name = "/cpu:0"

with tf.device(device_name): random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation)

startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result)

//It can be hard to see the results on the terminal with lots of output -- add some newlines to improve readability. print("\n" * 5) print("Shape:", shape, "Device:", device_name) print("Time taken:", datetime.now() - startTime)

print("\n" * 5)

Describe the expected behavior But run another python script successfully & HIP Examples also run successfully

//cat test_single_gpu.py
from future import print_function ''' Basic Multi GPU computation example using TensorFlow library. Author: Aymeric Damien Project: https://github.com/aymericdamien/TensorFlow-Examples/ '''

''' This tutorial requires your machine to have 1 GPU "/cpu:0": The CPU of your machine. "/gpu:0": The first GPU of your machine '''

import numpy as np import tensorflow as tf import datetime

// Processing Units logs log_device_placement = True

// Num of multiplications to perform n = 10

''' Example: compute A^n + B^n on 2 GPUs Results on 8 cores with 2 GTX-980:

// Create a graph to store results c1 = [] c2 = []

def matpow(M, n): if n < 1: #Abstract cases where n < 1 return M else: return tf.matmul(M, matpow(M, n-1))

''' Single GPU computing ''' with tf.device('/gpu:0'): a = tf.placeholder(tf.float32, [10000, 10000]) b = tf.placeholder(tf.float32, [10000, 10000])

Compute A^n and B^n and store results in c1

c1.append(matpow(a, n))
c1.append(matpow(b, n))

with tf.device('/cpu:0'): sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:

Run the op.

sess.run(sum, {a:A, b:B})

t2_1 = datetime.datetime.now()

print("Single GPU computation time: " + str(t2_1-t1_1))

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

wormwang commented 5 years ago

at same time we meet test failure on HIP Ctests

The following tests FAILED: 12 - directed_tests/deviceLib/hipAsynchronousStreams.tst (Child aborted) 97 - directed_tests/runtimeApi/memory/hipMemset2D.tst (Child aborted) 98 - directed_tests/runtimeApi/memory/hipMemset3D.tst (Child aborted) 110 - directed_tests/runtimeApi/stream/hipStreamCreateWithPriority.tst (Failed) 115 - directed_tests/surface/hipSurfaceObj2D.tst (SEGFAULT) 116 - directed_tests/texture/hipBindTexRef1DFetch.tst (Child aborted) 117 - directed_tests/texture/hipGetChanDesc.tst (Child aborted) 118 - directed_tests/texture/hipTextureObj1DFetch.tst (Child aborted) 119 - directed_tests/texture/hipTextureObj2D.tst (SEGFAULT) 120 - directed_tests/texture/hipTextureRef2D.tst (SEGFAULT) Errors while running CTest

Did the issue linked with the TF error?

wormwang commented 5 years ago

runt the bad python script in gdb $ gdb -ex r --args python3 tf-gpu.py gpu 1000

2019-06-13 15:40:27.654097: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-13 15:40:27.654117: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 [New Thread 0xffff177c61f0 (LWP 8391)] [Thread 0xffff177c61f0 (LWP 8390) exited] [New Thread 0xffff177c61f0 (LWP 8392)] [Thread 0xffff177c61f0 (LWP 8391) exited] [New Thread 0xffff177c61f0 (LWP 8393)] [Thread 0xffff177c61f0 (LWP 8392) exited] [Thread 0xffff177c61f0 (LWP 8393) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception

Thread 196 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff187c81f0 (LWP 8388)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x0000ffffbf40d8b4 in __GI_abort () at abort.c:79

2 0x0000ffff539050c4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

3 0x0000ffff53902c34 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

4 0x0000ffff53902c80 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

5 0x0000ffff53902f38 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

6 0x0000ffff57c9264c in hip_impl::hip_throw(std::exception const&) () from /opt/rocm/lib/libhip_hcc.so

7 0x0000ffff5c2803e0 in std::vector<unsigned char, std::allocator > hip_impl::make_kernarg<tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>, tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(void ()(tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>), std::tuple<tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

8 0x0000ffff5c272334 in tensorflow::functor::FillPhiloxRandom<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::operator()(tensorflow::OpKernelContext, Eigen::GpuDevice const&, tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

9 0x0000ffff5c26dda0 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

whchung commented 5 years ago

+@sunway513

@wormwang you configuration seems to be pretty outdated (ROCm 2.3 + TF 1.12) and doesn't seem to be something we test in general (UB 18.04). Between ROCm 2.2 - 2.4 there were tremendous changes in underlying runtime & compiler components so I can imagine things go strange in case one component went not aligned.

Wondering could you try the docker container with ROCm 2.5 + TF 1.13? rocm/tensorflow:rocm2.5-tf1.13-python3 could be a good tag to start.

sunway513 commented 5 years ago

Hi @wormwang , like @whchung suggested please firstly upgrade your rocm installations, especially the rock-dkms package. Please also make sure the HIP unit tests can pass fine in the rocm2.5 TF docker image before trying to run TF scripts.

sunway513 commented 5 years ago

@wormwang can you post your questions on ROCm github repository please: https://github.com/RadeonOpenCompute/ROCm/issues

wormwang commented 5 years ago

reproduce same error with rocm 2.4,while HIP program work well runt the bad python script in gdb $ gdb -ex r --args python3 tf-gpu.py gpu 1000

transpose/perm: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-20 13:51:29.562211: I tensorflow/core/common_runtime/placer.cc:927] transpose/perm: (Const)/job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2019-06-20 13:51:29.562229: I tensorflow/core/common_runtime/placer.cc:927] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 [New Thread 0xffff0bfc71f0 (LWP 17608)] [Thread 0xffff0bfc71f0 (LWP 17607) exited] [New Thread 0xffff0bfc71f0 (LWP 17624)] [Thread 0xffff0bfc71f0 (LWP 17608) exited] [New Thread 0xffff0bfc71f0 (LWP 17625)] [Thread 0xffff0bfc71f0 (LWP 17624) exited] [Thread 0xffff0bfc71f0 (LWP 17625) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception

Thread 197 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff0c7c81f0 (LWP 17606)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x0000ffffbf40d8b4 in __GI_abort () at abort.c:79

2 0x0000ffff4e0400c4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

3 0x0000ffff4e03dc34 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

4 0x0000ffff4e03dc80 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

5 0x0000ffff4e03df38 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

6 0x0000ffff57ba7a74 in hip_impl::hip_throw(std::exception const&) () from /opt/rocm/lib/libhip_hcc.so

7 0x0000ffff5c21d91c in std::vector<unsigned char, std::allocator > hip_impl::make_kernarg<tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>, tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(void ()(tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>), std::tuple<tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

8 0x0000ffff5c20eaf4 in tensorflow::functor::FillPhiloxRandom<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::operator()(tensorflow::OpKernelContext, Eigen::GpuDevice const&, tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

9 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

10 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

11 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

12 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

13 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

14 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

15 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

16 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

17 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

18 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

19 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

20 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

21 0x0000ffff5c20a560 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

wormwang commented 5 years ago

We reproduce the core dump error on other tf script.

https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/1_Introduction/helloworld.py

runing the helloworld is well

2019-06-22 23:19:17.799338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon RX Vega 64], pci bus id: 0000:33:00.0) b'Hello, TensorFlow!'

we meet core dump when run basic_operations.py

https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/1_Introduction/basic_operations.py

gdb -ex r --args python3 basic_operations.py

Addition with constants: 5 [New Thread 0xffff0bfc71f0 (LWP 10720)] [New Thread 0xffff0bfc71f0 (LWP 10721)] [Thread 0xffff0bfc71f0 (LWP 10720) exited] [New Thread 0xffff0bfc71f0 (LWP 10722)] [Thread 0xffff0bfc71f0 (LWP 10721) exited] [New Thread 0xffff0bfc71f0 (LWP 10723)] [Thread 0xffff0bfc71f0 (LWP 10722) exited] Multiplication with constants: 6 2019-06-22 23:01:23.131450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0 2019-06-22 23:01:23.131508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-22 23:01:23.131523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057] 0 2019-06-22 23:01:23.131536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0: N 2019-06-22 23:01:23.131588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon RX Vega 64], pci bus id: 0000:33:00.0) [New Thread 0xffff35abd1f0 (LWP 10724)] [New Thread 0xffff362be1f0 (LWP 10725)] [Thread 0xffff0bfc71f0 (LWP 10723) exited] [Thread 0xffff35abd1f0 (LWP 10620) exited] [Thread 0xffff362be1f0 (LWP 10619) exited] [New Thread 0xffff0bfc71f0 (LWP 10726)] [New Thread 0xffff0bfc71f0 (LWP 10727)] [Thread 0xffff0bfc71f0 (LWP 10726) exited] [New Thread 0xffff0bfc71f0 (LWP 10728)] [Thread 0xffff0bfc71f0 (LWP 10727) exited] Addition with variables: 5 [New Thread 0xffff0bfc71f0 (LWP 10729)] [Thread 0xffff0bfc71f0 (LWP 10728) exited] [New Thread 0xffff0bfc71f0 (LWP 10730)] [New Thread 0xffff0bfc71f0 (LWP 10731)] [Thread 0xffff0bfc71f0 (LWP 10730) exited] [Thread 0xffff0bfc71f0 (LWP 10729) exited] [Thread 0xffff0bfc71f0 (LWP 10731) exited] terminate called after throwing an instance of 'std::exception' what(): std::exception

Thread 195 "python3" received signal SIGABRT, Aborted. [Switching to Thread 0xffff0d7ca1f0 (LWP 10683)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x0000ffffbf40d8b4 in __GI_abort () at abort.c:79

2 0x0000ffff4e04c0c4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

3 0x0000ffff4e049c34 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

4 0x0000ffff4e049c80 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

5 0x0000ffff4e049f38 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

6 0x0000ffff57bb3a74 in hip_impl::hip_throw(std::exception const&) () from /opt/rocm/lib/libhip_hcc.so

7 0x0000ffff5cf6275c in std::vector<unsigned char, std::allocator > hip_impl::make_kernarg<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, int, Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, int>(void (*)(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, int), std::tuple<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, int>) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

8 0x0000ffff5cf62270 in Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::GpuDevice, false>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<short, 1, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<short, short, Eigen::internal::scalar_product_op<short, short> >, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, int>, 16, Eigen::MakePointer> const> const> const&, Eigen::GpuDevice const&) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

9 0x0000ffff5cf31d9c in tensorflow::functor::BinaryFunctor<Eigen::GpuDevice, tensorflow::functor::mul, 1, false>::Right(Eigen::GpuDevice const&, Eigen::TensorMap<Eigen::Tensor<short, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<short const, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<short const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, bool*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

10 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

11 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

12 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

13 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

14 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

15 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

16 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

17 0x0000ffff5c87579c in tensorflow::BinaryOp<Eigen::GpuDevice, tensorflow::functor::mul >::Compute(tensorflow::OpKernelContext*) ()

from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

whchung commented 5 years ago

@wormwang Looking at the stacktrace I assume something wrong inside HIP implementation on your platform wrt kernel arguments. May I understand are you using the official HIP implementation or do you keep a downstream fork?

wormwang commented 5 years ago

I just build the HIP on that git clone HIP of ROCm 2.4 by repo sync.

I don't touch the source code of HIP.

At the other side, some HIP example App run well.

whchung commented 5 years ago

@wormwang Thanks for additional information. Now this ticket is not entirely related to TensorFlow but in HIP. I'll see to what extent I can help you here.

Relevant code in HIP is here: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.4.x/include/hip/hcc_detail/functional_grid_launch.hpp#L100

You can observe there are 2 places a C++ exception may be raised. One is when a kernel (__global__ function) can't be located by HIP runtime, or when its metadata couldn't be located.

If you add traces or breakpoints you should be able to identify which exception was really raised. And you can find corresponding implementation details of HIP at: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.4.x/include/hip/hcc_detail/program_state.hpp

Now in this file you can see it traverses the binary via ELFIO, and I would recommend you add more traces to understand what symbols are missing. Since your platform is different from usual ROCm deployments I can imagine maybe there are some places have to be tuned.

whchung commented 5 years ago

Also in your comment you mentioned "some HIP example app" run well. I would frown upon that. Are you able to get "all" of them to pass on your platform? If not I'd surely start from there.

To the very least, "make tests" in HIP must all pass to ensure you have basic functionality of HIP on your platform. https://github.com/ROCm-Developer-Tools/HIP/tree/roc-2.4.x/tests

wormwang commented 5 years ago

make test on ROCm2.4 , then xxx at test 76. {updated test 76 is not hang ,but run on 700s }

75/120 Test #75: directed_tests/runtimeApi/event/hipEventRecord--iterations10.tst ............................... Passed 0.47 sec Start 76: directed_tests/runtimeApi/event/record_event.tst

whchung commented 5 years ago

@wormwang according to the test results the ticket is beyond the scope of TensorFlow, but regarding to getting HIP runtime working properly on your platform.

I’d recommend you get a supported system (ex: x86 + vega10/20 + UB 16.04), install the same version of ROCm, and compare the differences of HIP tests with your platform.

wormwang commented 5 years ago

sorry ,some mistaken on test 76 , it is not hang. but run on 700s

latest results on rocm2.4 and 5.0.21 kernel that have amdkfd and amdgpu etc

92% tests passed, 10 tests failed out of 120

Total Test time (real) = 1054.48 sec

The following tests FAILED: 12 - directed_tests/deviceLib/hipAsynchronousStreams.tst (Child aborted) 54 - directed_tests/kernel/hipLaunchParm.tst (Not Run) 97 - directed_tests/runtimeApi/memory/hipMemset2D.tst (Child aborted) 98 - directed_tests/runtimeApi/memory/hipMemset3D.tst (Child aborted) 115 - directed_tests/surface/hipSurfaceObj2D.tst (SEGFAULT) 116 - directed_tests/texture/hipBindTexRef1DFetch.tst (Child aborted) 117 - directed_tests/texture/hipGetChanDesc.tst (Child aborted) 118 - directed_tests/texture/hipTextureObj1DFetch.tst (Child aborted) 119 - directed_tests/texture/hipTextureObj2D.tst (SEGFAULT) 120 - directed_tests/texture/hipTextureRef2D.tst (SEGFAULT) Errors while running CTest Makefile:108: recipe for target 'test' failed make: *** [test] Error 8

whchung commented 5 years ago

Those failing tests should be looked into. But in general shouldn’t block TensorFlow from execution on ROCm. I’d like to ask you to add additional traces or use debugger to understand which C++ exception was raised inside hip_impl::make_kernarg so we know the next step.

sunway513 commented 5 years ago

Closing this ticket as ROCm doesn't support ARCH64 distro, there's no shortcut for TF-ROCm to be functional on that stack at the moment.

wormwang commented 5 years ago

I build tensorflow and HIP with debuginfo. I got more detail stackstace ,but, I can not find out which parameters is error.

(gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x0000ffffbf40d8b4 in __GI_abort () at abort.c:79

2 0x0000ffff453e10c4 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

3 0x0000ffff453dec34 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

4 0x0000ffff453dec80 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

5 0x0000ffff453def38 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6

6 0x0000ffff4f131a8c in hip_impl::hip_throw(std::exception const&) () from /opt/rocm/lib/libhip_hcc.so

7 0x0000ffff56308414 in hip_impl::make_kernarg<tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>, tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> > (kernel=

0xffff563093e0 <tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(tensorflow::random::PhiloxRandom, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>::ResultElementType*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>)>, 
actuals=std::tuple containing = {...}) at /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:114

8 0x0000ffff562cc454 in hipLaunchKernelGGL<tensorflow::random::PhiloxRandom, float, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>, void ()(tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>)> (kernel=

0xffff563093e0 <tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >(tensorflow::random::PhiloxRandom, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>::ResultElementType*, long long, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>)>, numBlocks=..., 
dimBlocks=..., sharedMemBytes=0, stream=0x99b540, args=..., args=..., args=..., args=...) at /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:181

9 0x0000ffff562cc330 in tensorflow::functor::FillPhiloxRandom<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::operator() (this=0xffff03fd5af8,

d=..., gen=..., data=0x4102000500, size=1000000, dist=...) at tensorflow/core/kernels/random_op_gpu.cu.cc:225

10 0x0000ffff562c0f78 in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::GpuDevice, tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float> >::Compute (

this=0xffff3bcc5f70, ctx=0xffff03fd64b0) at tensorflow/core/kernels/random_op.cc:204

11 0x0000ffff4e7b3088 in tensorflow::BaseGPUDevice::ComputeHelper (this=this@entry=0xfffe21a53c40, op_kernel=op_kernel@entry=0xffff3bcc5f70, context=context@entry=0xffff03fd64b0)

at tensorflow/core/common_runtime/gpu/gpu_device.cc:548

12 0x0000ffff4e7b3364 in tensorflow::BaseGPUDevice::Compute (this=0xfffe21a53c40, op_kernel=0xffff3bcc5f70, context=0xffff03fd64b0) at tensorflow/core/common_runtime/gpu/gpu_device.cc:486

13 0x0000ffff4e7fabc8 in tensorflow::(anonymous namespace)::ExecutorState::Process (this=, tagged_node=..., scheduled_nsec=)

at tensorflow/core/common_runtime/executor.cc:1782

14 0x0000ffff4e7fb3cc in tensorflow::(anonymous namespace)::ExecutorState::<lambda()>::operator() (__closure=) at tensorflow/core/common_runtime/executor.cc:2200

15 std::_Function_handler<void(), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(const TaggedNodeSeq&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316

16 0x0000ffff4e80df84 in std::function<void ()>::operator()() const (this=) at /usr/include/c++/7/bits/std_function.h:706

17 0x0000ffff4e8ab17c in tensorflow::thread::EigenEnvironment::ExecuteTask (t=..., this=0xffff3bdd88a8) at tensorflow/core/lib/core/threadpool.cc:80

18 Eigen::NonBlockingThreadPoolTempl::WorkerLoop (this=0xffff3bdd88a0, thread_id=)

at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232

19 0x0000ffff4e8ab384 in Eigen::NonBlockingThreadPoolTempl::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()() const (__closure=) at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:65

20 std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempl::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316

21 0x0000ffff4e80df84 in std::function<void ()>::operator()() const (this=this@entry=0xffff2ce25660) at /usr/include/c++/7/bits/std_function.h:706

22 0x0000ffff4e8a994c in tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}::operator()() const (__closure=0xffff2ce25660)

at tensorflow/core/lib/core/threadpool.cc:57

23 std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...)

whchung commented 5 years ago

I’d like to ask you to add additional traces or use debugger to understand which C++ exception was raised inside hip_impl::make_kernarg so we know the next step.