ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
687 stars 93 forks source link

Tensorflow:r1.12-rocm build failed on AMD APU #304

Closed ghostplant closed 5 years ago

ghostplant commented 5 years ago

OS => Ubuntu 18.10 (with Linux Image 4.20)

Python => 2.7

Tensorflow => https://github.com/ROCmSoftwarePlatform/tensorflow-upstream, branch = r1.12-rocm

Bazel => 0.15.0

Build command => ./build_rocm

AMD GPUs => AMD Ryzen

ROCm Version => 2.0.0

GCC Version => GCC-8

Bazel Error Logs =>

...
...
INFO: Analysed target //tensorflow/tools/pip_package:build_pip_package (330 packages loaded).
INFO: Found 1 target...
ERROR: /root/hip_example/tensorflow-upstream/tensorflow/tools/pip_package/BUILD:204:1: Creating runfiles tree bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles failed: Process exited with status 1: Process exited with status 1
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 10.514s, Critical Path: 1.30s
INFO: 7 processes: 7 local.
FAILED: Build did NOT complete successfully
whchung commented 5 years ago

to build r1.12, please upgrade bazel to 0.19.2

sunway513 commented 5 years ago

@ghostplant can you try to build r1.12-rocm using the following dev docker image? rocm/tensorflow:rocm2.0-tf1.12-python3-dev The docker image has all the dependency needed to build TF from source.

ghostplant commented 5 years ago

@sunway513 @whchung Hi, I updated bazel to 0.19.2. However, it still failed for the same reason but the log is more detailed this time:

INFO: Analysed target //tensorflow/tools/pip_package:build_pip_package (328 packages loaded, 17252 targets configured).
INFO: Found 1 target...
ERROR: /root/hip_example/tensorflow-upstream/tensorflow/tools/pip_package/BUILD:204:1: Creating runfiles tree bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles failed: build-runfiles failed: error executing command
  (cd /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow && \
  exec env - \
    HIP_PLATFORM=hcc \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64 \
    PYTHON_BIN_PATH=/usr/bin/python \
    PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
    TF_DOWNLOAD_CLANG=0 \
    TF_NEED_CUDA=0 \
    TF_NEED_OPENCL_SYCL=0 \
    TF_NEED_ROCM=1 \
  /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/_bin/build-runfiles bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles_manifest bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles): Process exited with status 1
/root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/_bin/build-runfiles (args bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles_manifest bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles): link or target filename contains space on line 2146: 'local_config_rocm/rocm/rocm/include/thrust/system/cuda/detail/cub-hip/eclipse code style profile.xml /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/bazel-out/k8-opt/genfiles/external/local_config_rocm/rocm/rocm/include/thrust/system/cuda/detail/cub-hip/eclipse code style profile.xml'

: Process exited with status 1
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 15.601s, Critical Path: 0.38s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 2 processes: 2 local.
FAILED: Build did NOT complete successfully

I have to build outside docker for some reasons. Can you provide the Dockerfile to generate rocm/tensorflow:rocm2.0-tf1.12-python3-dev so that I could look into the difference with my host environment. Thanks!

whchung commented 5 years ago

This is something new... from the look of it it seems to be related to a new file in cub-hip repository which unfortunately has a file name with empty space. Let me check cub-hip repo real quick

whchung commented 5 years ago

@ghostplant it doesn't look right to me... in this project we never really depend on cub-hip project.

Could you check if it's because you somehow have a cub-hip installation under /opt/rocm? Could you try remove it?

ghostplant commented 5 years ago

I saw the cub-hip header resources belong to ubuntu package hip-thrust.

Now I purged it and rebuilt tensorflow again. Seems that the building process keeps going now and everything looks well. Great suggestion!

However, I encountered another bug when the whole building progress is mostly finished:

1 warning generated.
ERROR: /root/hip_example/tensorflow-upstream/tensorflow/core/kernels/BUILD:759:1: C++ compilation of rule '//tensorflow/core/kernels:matrix_band_part_op' failed (Exit 1): crosstool_wrapper_driver_rocm failed: error executing command
  (cd /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/o$
t/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64 \
    PWD=/proc/self/cwd \
  external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_rocm -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections '-std=c++11' -MD -MF b$
zel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.d '-frandom-seed=bazel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.o' -fPIC -DEIGEN_MPL2_ONLY -D__CLAN$
_SUPPORT_DYN_ANNOTATION__ -DTF_USE_SNAPPY -DCURL_STATICLIB -DPLATFORM_LINUX -DENABLE_CURL_CLIENT -DENABLE_NO_ENCRYPTION -iquote . -iquote bazel-out/host/genfiles -iquote bazel-out/host/bin -iquote external/nsync -iquote bazel-out/host/g$
nfiles/external/nsync -iquote bazel-out/host/bin/external/nsync -iquote external/bazel_tools -iquote bazel-out/host/genfiles/external/bazel_tools -iquote bazel-out/host/bin/external/bazel_tools -iquote external/eigen_archive -iquote baz$
l-out/host/genfiles/external/eigen_archive -iquote bazel-out/host/bin/external/eigen_archive -iquote external/local_config_sycl -iquote bazel-out/host/genfiles/external/local_config_sycl -iquote bazel-out/host/bin/external/local_config_$
ycl -iquote external/com_google_absl -iquote bazel-out/host/genfiles/external/com_google_absl -iquote bazel-out/host/bin/external/com_google_absl -iquote external/gif_archive -iquote bazel-out/host/genfiles/external/gif_archive -iquote $
azel-out/host/bin/external/gif_archive -iquote external/jpeg -iquote bazel-out/host/genfiles/external/jpeg -iquote bazel-out/host/bin/external/jpeg -iquote external/protobuf_archive -iquote bazel-out/host/genfiles/external/protobuf_arch$
ve -iquote bazel-out/host/bin/external/protobuf_archive -iquote external/com_googlesource_code_re2 -iquote bazel-out/host/genfiles/external/com_googlesource_code_re2 -iquote bazel-out/host/bin/external/com_googlesource_code_re2 -iquote $
xternal/farmhash_archive -iquote bazel-out/host/genfiles/external/farmhash_archive -iquote bazel-out/host/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/host/genfiles/external/fft2d -iquote bazel-out/host/bin/ext$
rnal/fft2d -iquote external/highwayhash -iquote bazel-out/host/genfiles/external/highwayhash -iquote bazel-out/host/bin/external/highwayhash -iquote external/zlib_archive -iquote bazel-out/host/genfiles/external/zlib_archive -iquote baz$
l-out/host/bin/external/zlib_archive -iquote external/local_config_rocm -iquote bazel-out/host/genfiles/external/local_config_rocm -iquote bazel-out/host/bin/external/local_config_rocm -iquote external/local_config_cuda -iquote bazel-ou$
/host/genfiles/external/local_config_cuda -iquote bazel-out/host/bin/external/local_config_cuda -iquote external/double_conversion -iquote bazel-out/host/genfiles/external/double_conversion -iquote bazel-out/host/bin/external/double_con$
ersion -iquote external/curl -iquote bazel-out/host/genfiles/external/curl -iquote bazel-out/host/bin/external/curl -iquote external/boringssl -iquote bazel-out/host/genfiles/external/boringssl -iquote bazel-out/host/bin/external/boring$
sl -iquote external/jsoncpp_git -iquote bazel-out/host/genfiles/external/jsoncpp_git -iquote bazel-out/host/bin/external/jsoncpp_git -iquote external/aws -iquote bazel-out/host/genfiles/external/aws -iquote bazel-out/host/bin/external/a$
s -isystem external/nsync/public -isystem bazel-out/host/genfiles/external/nsync/public -isystem bazel-out/host/bin/external/nsync/public -isystem external/eigen_archive -isystem bazel-out/host/genfiles/external/eigen_archive -isystem b$
zel-out/host/bin/external/eigen_archive -isystem external/gif_archive/lib -isystem bazel-out/host/genfiles/external/gif_archive/lib -isystem bazel-out/host/bin/external/gif_archive/lib -isystem external/protobuf_archive/src -isystem baz$
l-out/host/genfiles/external/protobuf_archive/src -isystem bazel-out/host/bin/external/protobuf_archive/src -isystem external/farmhash_archive/src -isystem bazel-out/host/genfiles/external/farmhash_archive/src -isystem bazel-out/host/bi$
/external/farmhash_archive/src -isystem external/zlib_archive -isystem bazel-out/host/genfiles/external/zlib_archive -isystem bazel-out/host/bin/external/zlib_archive -isystem external/local_config_rocm/rocm -isystem bazel-out/host/genfi
les/external/local_config_rocm/rocm -isystem bazel-out/host/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/host/genfiles/external/local_config_rocm/rocm/rocm/include -isystem
bazel-out/host/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_cuda/cuda -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda -isystem bazel-out/host/bin/external/local_config_cuda/cuda -isyste
m external/local_config_cuda/cuda/cuda/include -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda/cuda/include -isystem bazel-out/host/bin/external/local_config_cuda/cuda/cuda/include -isystem external/local_config_cuda/cud
a/cuda/include/crt -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda/cuda/include/crt -isystem bazel-out/host/bin/external/local_config_cuda/cuda/cuda/include/crt -isystem external/local_config_rocm/rocm/rocm/include/rocra
nd -isystem bazel-out/host/genfiles/external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/host/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/double_conversion -isystem bazel-out/host/genfile
s/external/double_conversion -isystem bazel-out/host/bin/external/double_conversion -isystem external/curl/include -isystem bazel-out/host/genfiles/external/curl/include -isystem bazel-out/host/bin/external/curl/include -isystem external
/boringssl/src/include -isystem bazel-out/host/genfiles/external/boringssl/src/include -isystem bazel-out/host/bin/external/boringssl/src/include -isystem external/jsoncpp_git/include -isystem bazel-out/host/genfiles/external/jsoncpp_git
/include -isystem bazel-out/host/bin/external/jsoncpp_git/include -isystem external/aws/aws-cpp-sdk-core/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-core/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-
core/include -isystem external/aws/aws-cpp-sdk-kinesis/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-kinesis/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-kinesis/include -isystem external/aws/aws-cpp-s
dk-s3/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-s3/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-s3/include -g0 '-march=haswell' -g0 -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare -fn
o-exceptions '-ftemplate-depth=900' '-DTENSORFLOW_USE_ROCM=1' -msse3 -pthread '-DTENSORFLOW_USE_ROCM=1' -DTENSORFLOW_USE_ROCM -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__T
IME__="redacted"' -D__HIP_PLATFORM_HCC__ -DEIGEN_USE_HIP -fno-canonical-system-headers -c tensorflow/core/kernels/matrix_band_part_op.cc -o bazel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.o)
tensorflow/core/kernels/matrix_band_part_op.cc: In instantiation of 'void tensorflow::functor::MatrixBandPartFunctor<Eigen::ThreadPoolDevice, Scalar>::operator()(tensorflow::OpKernelContext*, const CPUDevice&, int, int, typename tensorfl
ow::TTypes<Scalar, 3>::ConstTensor, typename tensorflow::TTypes<Scalar, 3>::Tensor) [with Scalar = long long int; tensorflow::functor::CPUDevice = Eigen::ThreadPoolDevice; typename tensorflow::TTypes<Scalar, 3>::ConstTensor = Eigen::Tens
orMap<Eigen::Tensor<const long long int, 3, 1, long int>, 16, Eigen::MakePointer>; typename tensorflow::TTypes<Scalar, 3>::Tensor = Eigen::TensorMap<Eigen::Tensor<long long int, 3, 1, long int>, 16, Eigen::MakePointer>]':
tensorflow/core/kernels/matrix_band_part_op.cc:193:1:   required from here
tensorflow/core/kernels/matrix_band_part_op.cc:153:41: internal compiler error: in lookup_template_class_1, at cp/pt.c:9459
       const int64 batch_begin = begin / m;
                                         ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-8/README.Bugs> for instructions.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 1030.855s, Critical Path: 127.53s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 4988 processes: 4988 local.
FAILED: Build did NOT complete successfully
whchung commented 5 years ago

this is a gcc bug, for complex C++ templates gcc may sometimes blow up. to overcome that you might need more RAM on the system.

ghostplant commented 5 years ago

The host RAM is 16 GB and I didn't see the the system has run out of memory.

whchung commented 5 years ago

could you verify if gcc always crash at the same spot? also may I understand which CPU you are using? from my experiences it may happen occasionally on the 1st gen Ryzen, but normally it goes away with BIOS update, and it shouldn't happen on 2nd gen Ryzen or EPYC.

ghostplant commented 5 years ago

I keep to continue the building process every time it crashed, and finally it finished to generate the wheel package. However, the generated package is not working well on my host, because when some Eigen kernel needs to be launched, it failed with internal error response.

ghostplant commented 5 years ago

I need a healthy tensorflow-rocm package built for gfx902. Is there some ways to get access to it?

whchung commented 5 years ago

@ghostplant unfortunately there are some warning signs based your comments:

whchung commented 5 years ago

@ghostplant one thing you can experiment is to test the docker image referred by @sunway513 earlier in the thread:

rocm/tensorflow:rocm2.0-tf1.12-python3-dev

The image has all dependent packages installed. But all GPU kernels inside are for gfx803, gfx900, gfx906. And I don't really know what happens on a gfx902 system. More than likely at TensorFlow initialization it would complain there is no compatible GPU found and then execute the model on CPU.

ghostplant commented 5 years ago

OK, is 2st gen Ryzen APU released and support rocm now?

whchung commented 5 years ago

APU support on ROCm is still under internal development at this moment. There are quite a few lower-level software components (Linux driver, low-level runtime, high-level runtime, compiler) needs to be revised. Some PRs in this project were actually made to lay the groundwork for APU support. Please keep an watchful eye on the announcement from AMD this year.

ghostplant commented 5 years ago

OK, Thank you for your information!

delijati commented 2 years ago

So what happened to APU support? If there is anything written down, i would love to read it ;)