Closed ghostplant closed 5 years ago
to build r1.12, please upgrade bazel to 0.19.2
@ghostplant can you try to build r1.12-rocm using the following dev docker image? rocm/tensorflow:rocm2.0-tf1.12-python3-dev The docker image has all the dependency needed to build TF from source.
@sunway513 @whchung Hi, I updated bazel to 0.19.2. However, it still failed for the same reason but the log is more detailed this time:
INFO: Analysed target //tensorflow/tools/pip_package:build_pip_package (328 packages loaded, 17252 targets configured).
INFO: Found 1 target...
ERROR: /root/hip_example/tensorflow-upstream/tensorflow/tools/pip_package/BUILD:204:1: Creating runfiles tree bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles failed: build-runfiles failed: error executing command
(cd /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow && \
exec env - \
HIP_PLATFORM=hcc \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64 \
PYTHON_BIN_PATH=/usr/bin/python \
PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
TF_DOWNLOAD_CLANG=0 \
TF_NEED_CUDA=0 \
TF_NEED_OPENCL_SYCL=0 \
TF_NEED_ROCM=1 \
/root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/_bin/build-runfiles bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles_manifest bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles): Process exited with status 1
/root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/_bin/build-runfiles (args bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles_manifest bazel-out/k8-opt/bin/tensorflow/tools/pip_package/build_pip_package.runfiles): link or target filename contains space on line 2146: 'local_config_rocm/rocm/rocm/include/thrust/system/cuda/detail/cub-hip/eclipse code style profile.xml /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow/bazel-out/k8-opt/genfiles/external/local_config_rocm/rocm/rocm/include/thrust/system/cuda/detail/cub-hip/eclipse code style profile.xml'
: Process exited with status 1
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 15.601s, Critical Path: 0.38s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 2 processes: 2 local.
FAILED: Build did NOT complete successfully
I have to build outside docker for some reasons. Can you provide the Dockerfile to generate rocm/tensorflow:rocm2.0-tf1.12-python3-dev so that I could look into the difference with my host environment. Thanks!
This is something new... from the look of it it seems to be related to a new file in cub-hip repository which unfortunately has a file name with empty space. Let me check cub-hip repo real quick
@ghostplant it doesn't look right to me... in this project we never really depend on cub-hip
project.
Could you check if it's because you somehow have a cub-hip
installation under /opt/rocm
? Could you try remove it?
I saw the cub-hip
header resources belong to ubuntu package hip-thrust
.
Now I purged it and rebuilt tensorflow again. Seems that the building process keeps going now and everything looks well. Great suggestion!
However, I encountered another bug when the whole building progress is mostly finished:
1 warning generated.
ERROR: /root/hip_example/tensorflow-upstream/tensorflow/core/kernels/BUILD:759:1: C++ compilation of rule '//tensorflow/core/kernels:matrix_band_part_op' failed (Exit 1): crosstool_wrapper_driver_rocm failed: error executing command
(cd /root/.cache/bazel/_bazel_root/96f62968e811ec4f04f631ea64f4301a/execroot/org_tensorflow && \
exec env - \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/o$
t/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64:/opt/rocm/bin:/opt/rocm/opencl/bin/x86_64 \
PWD=/proc/self/cwd \
external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_rocm -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections '-std=c++11' -MD -MF b$
zel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.d '-frandom-seed=bazel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.o' -fPIC -DEIGEN_MPL2_ONLY -D__CLAN$
_SUPPORT_DYN_ANNOTATION__ -DTF_USE_SNAPPY -DCURL_STATICLIB -DPLATFORM_LINUX -DENABLE_CURL_CLIENT -DENABLE_NO_ENCRYPTION -iquote . -iquote bazel-out/host/genfiles -iquote bazel-out/host/bin -iquote external/nsync -iquote bazel-out/host/g$
nfiles/external/nsync -iquote bazel-out/host/bin/external/nsync -iquote external/bazel_tools -iquote bazel-out/host/genfiles/external/bazel_tools -iquote bazel-out/host/bin/external/bazel_tools -iquote external/eigen_archive -iquote baz$
l-out/host/genfiles/external/eigen_archive -iquote bazel-out/host/bin/external/eigen_archive -iquote external/local_config_sycl -iquote bazel-out/host/genfiles/external/local_config_sycl -iquote bazel-out/host/bin/external/local_config_$
ycl -iquote external/com_google_absl -iquote bazel-out/host/genfiles/external/com_google_absl -iquote bazel-out/host/bin/external/com_google_absl -iquote external/gif_archive -iquote bazel-out/host/genfiles/external/gif_archive -iquote $
azel-out/host/bin/external/gif_archive -iquote external/jpeg -iquote bazel-out/host/genfiles/external/jpeg -iquote bazel-out/host/bin/external/jpeg -iquote external/protobuf_archive -iquote bazel-out/host/genfiles/external/protobuf_arch$
ve -iquote bazel-out/host/bin/external/protobuf_archive -iquote external/com_googlesource_code_re2 -iquote bazel-out/host/genfiles/external/com_googlesource_code_re2 -iquote bazel-out/host/bin/external/com_googlesource_code_re2 -iquote $
xternal/farmhash_archive -iquote bazel-out/host/genfiles/external/farmhash_archive -iquote bazel-out/host/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/host/genfiles/external/fft2d -iquote bazel-out/host/bin/ext$
rnal/fft2d -iquote external/highwayhash -iquote bazel-out/host/genfiles/external/highwayhash -iquote bazel-out/host/bin/external/highwayhash -iquote external/zlib_archive -iquote bazel-out/host/genfiles/external/zlib_archive -iquote baz$
l-out/host/bin/external/zlib_archive -iquote external/local_config_rocm -iquote bazel-out/host/genfiles/external/local_config_rocm -iquote bazel-out/host/bin/external/local_config_rocm -iquote external/local_config_cuda -iquote bazel-ou$
/host/genfiles/external/local_config_cuda -iquote bazel-out/host/bin/external/local_config_cuda -iquote external/double_conversion -iquote bazel-out/host/genfiles/external/double_conversion -iquote bazel-out/host/bin/external/double_con$
ersion -iquote external/curl -iquote bazel-out/host/genfiles/external/curl -iquote bazel-out/host/bin/external/curl -iquote external/boringssl -iquote bazel-out/host/genfiles/external/boringssl -iquote bazel-out/host/bin/external/boring$
sl -iquote external/jsoncpp_git -iquote bazel-out/host/genfiles/external/jsoncpp_git -iquote bazel-out/host/bin/external/jsoncpp_git -iquote external/aws -iquote bazel-out/host/genfiles/external/aws -iquote bazel-out/host/bin/external/a$
s -isystem external/nsync/public -isystem bazel-out/host/genfiles/external/nsync/public -isystem bazel-out/host/bin/external/nsync/public -isystem external/eigen_archive -isystem bazel-out/host/genfiles/external/eigen_archive -isystem b$
zel-out/host/bin/external/eigen_archive -isystem external/gif_archive/lib -isystem bazel-out/host/genfiles/external/gif_archive/lib -isystem bazel-out/host/bin/external/gif_archive/lib -isystem external/protobuf_archive/src -isystem baz$
l-out/host/genfiles/external/protobuf_archive/src -isystem bazel-out/host/bin/external/protobuf_archive/src -isystem external/farmhash_archive/src -isystem bazel-out/host/genfiles/external/farmhash_archive/src -isystem bazel-out/host/bi$
/external/farmhash_archive/src -isystem external/zlib_archive -isystem bazel-out/host/genfiles/external/zlib_archive -isystem bazel-out/host/bin/external/zlib_archive -isystem external/local_config_rocm/rocm -isystem bazel-out/host/genfi
les/external/local_config_rocm/rocm -isystem bazel-out/host/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/host/genfiles/external/local_config_rocm/rocm/rocm/include -isystem
bazel-out/host/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_cuda/cuda -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda -isystem bazel-out/host/bin/external/local_config_cuda/cuda -isyste
m external/local_config_cuda/cuda/cuda/include -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda/cuda/include -isystem bazel-out/host/bin/external/local_config_cuda/cuda/cuda/include -isystem external/local_config_cuda/cud
a/cuda/include/crt -isystem bazel-out/host/genfiles/external/local_config_cuda/cuda/cuda/include/crt -isystem bazel-out/host/bin/external/local_config_cuda/cuda/cuda/include/crt -isystem external/local_config_rocm/rocm/rocm/include/rocra
nd -isystem bazel-out/host/genfiles/external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/host/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/double_conversion -isystem bazel-out/host/genfile
s/external/double_conversion -isystem bazel-out/host/bin/external/double_conversion -isystem external/curl/include -isystem bazel-out/host/genfiles/external/curl/include -isystem bazel-out/host/bin/external/curl/include -isystem external
/boringssl/src/include -isystem bazel-out/host/genfiles/external/boringssl/src/include -isystem bazel-out/host/bin/external/boringssl/src/include -isystem external/jsoncpp_git/include -isystem bazel-out/host/genfiles/external/jsoncpp_git
/include -isystem bazel-out/host/bin/external/jsoncpp_git/include -isystem external/aws/aws-cpp-sdk-core/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-core/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-
core/include -isystem external/aws/aws-cpp-sdk-kinesis/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-kinesis/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-kinesis/include -isystem external/aws/aws-cpp-s
dk-s3/include -isystem bazel-out/host/genfiles/external/aws/aws-cpp-sdk-s3/include -isystem bazel-out/host/bin/external/aws/aws-cpp-sdk-s3/include -g0 '-march=haswell' -g0 -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare -fn
o-exceptions '-ftemplate-depth=900' '-DTENSORFLOW_USE_ROCM=1' -msse3 -pthread '-DTENSORFLOW_USE_ROCM=1' -DTENSORFLOW_USE_ROCM -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__T
IME__="redacted"' -D__HIP_PLATFORM_HCC__ -DEIGEN_USE_HIP -fno-canonical-system-headers -c tensorflow/core/kernels/matrix_band_part_op.cc -o bazel-out/host/bin/tensorflow/core/kernels/_objs/matrix_band_part_op/matrix_band_part_op.pic.o)
tensorflow/core/kernels/matrix_band_part_op.cc: In instantiation of 'void tensorflow::functor::MatrixBandPartFunctor<Eigen::ThreadPoolDevice, Scalar>::operator()(tensorflow::OpKernelContext*, const CPUDevice&, int, int, typename tensorfl
ow::TTypes<Scalar, 3>::ConstTensor, typename tensorflow::TTypes<Scalar, 3>::Tensor) [with Scalar = long long int; tensorflow::functor::CPUDevice = Eigen::ThreadPoolDevice; typename tensorflow::TTypes<Scalar, 3>::ConstTensor = Eigen::Tens
orMap<Eigen::Tensor<const long long int, 3, 1, long int>, 16, Eigen::MakePointer>; typename tensorflow::TTypes<Scalar, 3>::Tensor = Eigen::TensorMap<Eigen::Tensor<long long int, 3, 1, long int>, 16, Eigen::MakePointer>]':
tensorflow/core/kernels/matrix_band_part_op.cc:193:1: required from here
tensorflow/core/kernels/matrix_band_part_op.cc:153:41: internal compiler error: in lookup_template_class_1, at cp/pt.c:9459
const int64 batch_begin = begin / m;
^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-8/README.Bugs> for instructions.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 1030.855s, Critical Path: 127.53s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 4988 processes: 4988 local.
FAILED: Build did NOT complete successfully
this is a gcc bug, for complex C++ templates gcc may sometimes blow up. to overcome that you might need more RAM on the system.
The host RAM is 16 GB and I didn't see the the system has run out of memory.
could you verify if gcc always crash at the same spot? also may I understand which CPU you are using? from my experiences it may happen occasionally on the 1st gen Ryzen, but normally it goes away with BIOS update, and it shouldn't happen on 2nd gen Ryzen or EPYC.
I keep to continue the building process every time it crashed, and finally it finished to generate the wheel package. However, the generated package is not working well on my host, because when some Eigen kernel needs to be launched, it failed with internal error response.
I need a healthy tensorflow-rocm package built for gfx902. Is there some ways to get access to it?
@ghostplant unfortunately there are some warning signs based your comments:
@ghostplant one thing you can experiment is to test the docker image referred by @sunway513 earlier in the thread:
rocm/tensorflow:rocm2.0-tf1.12-python3-dev
The image has all dependent packages installed. But all GPU kernels inside are for gfx803, gfx900, gfx906. And I don't really know what happens on a gfx902 system. More than likely at TensorFlow initialization it would complain there is no compatible GPU found and then execute the model on CPU.
OK, is 2st gen Ryzen APU released and support rocm now?
APU support on ROCm is still under internal development at this moment. There are quite a few lower-level software components (Linux driver, low-level runtime, high-level runtime, compiler) needs to be revised. Some PRs in this project were actually made to lay the groundwork for APU support. Please keep an watchful eye on the announcement from AMD this year.
OK, Thank you for your information!
So what happened to APU support? If there is anything written down, i would love to read it ;)
OS => Ubuntu 18.10 (with Linux Image 4.20)
Python => 2.7
Tensorflow => https://github.com/ROCmSoftwarePlatform/tensorflow-upstream, branch = r1.12-rocm
Bazel => 0.15.0
Build command =>
./build_rocm
AMD GPUs => AMD Ryzen
ROCm Version => 2.0.0
GCC Version => GCC-8
Bazel Error Logs =>