Open mojitonoproblem opened 3 years ago
Could you please run the build_rocm_python3
script to start the build (it is located in the root folder of the repository)?
@mojitonoproblem Just curious, Does hawaii r290/r390 can run properly on ROCm-4.3.0? ROCm teams said only ROCm-1.9.3 supports Hawaii, which released on 2018. Do you test any small samples, for examples, hip square sample?
@mojitonoproblem I did not notice the GPU model you mentioned in the initial comment. Let me ping a member of the team for further assistance. cc @sunway513
@mojitonoproblem Just curious, Does hawaii r290/r390 can run properly on ROCm-4.3.0? ROCm teams said only ROCm-1.9.3 supports Hawaii, which released on 2018. Do you test any small samples, for examples, hip square sample?
I didn't know that. Please let me know how to run those samples and I will post the results. Thank you
@mojitonoproblem I did not notice the GPU model you mentioned in the initial comment. Let me ping a member of the team for further assistance. cc @sunway513
Perfect. I hardcoded the path to the ROCM installation directory and the Python bin, and it is building so far.
@reza-amd I'm pasting the error resulting from running build_rocm_python3:
ERROR: /home/minion/.cache/bazel/_bazel_minion/95be990c2bc0fe49a10affcebca4a754/external/local_config_rocm/rocm/BUILD:129:11: @local_config_rocm//rocm:rocprim: missing input file 'external/local_config_rocm/rocm/rocm/include/hipcub/hipcub_version.hpp', owner: '@local_config_rocm//rocm:rocm/include/hipcub/hipcub_version.hpp'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /home/minion/.cache/bazel/_bazel_minion/95be990c2bc0fe49a10affcebca4a754/external/local_config_rocm/rocm/BUILD:129:11 2 input file(s) do not exist
INFO: Elapsed time: 5816.222s, Critical Path: 115.56s
INFO: 3324 processes: 109 internal, 3215 local.
FAILED: Build did NOT complete successfully
Let me know if I can do anything to help. Thank you
@mojitonoproblem do you have the ROCM_PATH and TF_NEED_ROCM env vars set when you run the configure command? (for e.g. - https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/build_rocm_python3#L31 )
If not, please set, and retry
If you are setting them and still running into the error, please paste the .tf_configure.bazelrc
file here
Your .tf_configure.bazelrc
should look something like
root@ixt-rack-04:/root/tensorflow# cat .tf_configure.bazelrc
build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib/python3/dist-packages"
build --python_path="/usr/bin/python3"
build --config=rocm
build --action_env ROCM_PATH="/opt/rocm-4.3.1"
build --action_env ROCBLAS_TENSILE_LIBPATH="/opt/rocm-4.3.1/lib/library"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_env=LD_LIBRARY_PATH
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu,-v1only
note the --action_env ROCM_PATH=...
line
@reza-amd Thank you for your kind help. I'm pasting the error resulted from the last attempt (with env variables and python path set):
ERROR: /home/minion/tensorflow-upstream/tensorflow/core/data/service/BUILD:544:23: //tensorflow/core/data/service:server_lib_headers_lib: missing inpu│··················
t file 'external/local_config_rocm/rocm/rocm/include/hipcub/hipcub_version.hpp', owner: '@local_config_rocm//rocm:rocm/include/hipcub/hipcub_version.h│··················
pp' │··················
Target //tensorflow/tools/pip_package:build_pip_package failed to build │··················
ERROR: /home/minion/tensorflow-upstream/tensorflow/core/data/service/BUILD:544:23 2 input file(s) do not exist │··················
INFO: Elapsed time: 5960.307s, Critical Path: 97.00s │··················
INFO: 3406 processes: 56 internal, 3350 local. │··················
FAILED: Build did NOT complete successfully │··················
as well as .tf_configure.bazelrc
, as requested:
$ cat tensorflow-upstream/.tf_configure.bazelrc
build --action_env PYTHON_BIN_PATH="/home/minion/anaconda3/envs/ai/bin/python"
build --action_env PYTHON_LIB_PATH="/home/minion/anaconda3/envs/ai/lib/python3.9/site-packages"
build --python_path="/home/minion/anaconda3/envs/ai/bin/python"
build --config=rocm
build --action_env ROCM_PATH="/opt/rocm-4.3.0"
build --action_env ROCBLAS_TENSILE_LIBPATH="/opt/rocm-4.3.0/lib/library"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_env=LD_LIBRARY_PATH
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu,-v1only
Thank you,
looks like you are using ROCm 4.3.0....can I request you to switch to ROCm 4.3.1 and try it out. thanks
@deven-amd It seems that there is a missing library, although it is actually installed:
$ /opt/rocm/bin/rocminfo
/opt/rocm/bin/rocminfo: error while loading shared libraries: libhsakmt.so.1: cannot open shared object file: No such file or directory
$ dpkg -L hsakmt-roct
/opt
/opt/rocm-4.3.0
/opt/rocm-4.3.0/lib
/opt/rocm-4.3.0/lib/libhsakmt.so.1.0.40300
/opt/rocm-4.3.0/share
/opt/rocm-4.3.0/share/doc
/opt/rocm-4.3.0/share/doc/hsakmt
/opt/rocm-4.3.0/share/doc/hsakmt/LICENSE.md
/opt/rocm-4.3.0/lib/libhsakmt.so
/opt/rocm-4.3.0/lib/libhsakmt.so.1
I cannot manage to get it work. Thanks
Please note that even with the repository 4.3.1, the installation directory is named 4.3.0.
For ROCm 4.3.1, the directory should be /opt/rocm-4.3.1
# dpkg -L hsakmt-roct
/opt
/opt/rocm-4.3.1
/opt/rocm-4.3.1/lib
/opt/rocm-4.3.1/lib/libhsakmt.so.1.0.40301
/opt/rocm-4.3.1/share
/opt/rocm-4.3.1/share/doc
/opt/rocm-4.3.1/share/doc/hsakmt
/opt/rocm-4.3.1/share/doc/hsakmt/LICENSE.md
/opt/rocm-4.3.1/lib/libhsakmt.so
/opt/rocm-4.3.1/lib/libhsakmt.so.1
I think maybe your ROCm install is not quite right. Perhaps try removing it altogether and putting on 4.3.1 fresh.
Hi @mojitonoproblem, Were you able to get a fresh rocm 4.3.1 install and try again?
Hi @jayfurmanek I was not able yet. I didn't want to add useless info to this thread. I started with a new install but cannot get it work. I'm gona try again today.
$ /opt/rocm-4.3.1/bin/rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Cannot allocate memory
$ dmesg | grep kfd
$
Thanks.
I uninstalled, rebooted and reinstalled everything, but cannot get rid of the previous error message. Any hints? Thanks in advance
It seems your ROCm install is still not right. How did you remove/reinstall? Maybe that will shed some light on what is going on here.
Also, note we did just move the top of the develop-upstream branch to be ROCm-4.5 based if you are still interested in building the latest.
@jayfurmanek thank you, I removed the 4.3.0 version using sudo apt purge rocm* comgr rock-dkms
, then changed the apt source to point to 4.3.1 and then sudo apt install rocm-dkms
(after apt update of course).
Now I'm pulling develop-upstream and changing apt source to 4.5. After attempting to build I'll post my results. Thanks
It appears that it cannot find rock-dkms. I just rebooted after removing 4.3.1 and issued the following command:
$ sudo apt install rocm-dkms
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
rocm-dkms : Depends: rock-dkms but it is not installable
E: Unable to correct problems, you have held broken packages.
@mojitonoproblem I was able to find a workaround for this issue (I am not sure if it is the right way).
I added: "-I/opt/rocm/include/" to cops
in the bazel file. After adding this, it got compiled successfully!
Hope it helps you as well!
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
TensorFlow installed from (source or binary): None
TensorFlow version:
develop-upstream
python --version Python 3.9.7
$ bazel --version bazel 3.7.2
$ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Trying to build, bazel gives the following error message:
ERROR: /home/minion/tensorflow-upstream/tensorflow/stream_executor/rocm/BUILD:393:11: undeclared inclusion(s) in rule '//tensorflow/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/stream_executor/rocm/rocm_helpers.cu.cc':
'/opt/rocm-4.3.0/hip/include/hip/hip_runtime.h'
'/opt/rocm-4.3.0/hip/include/hip/hip_version.h'
'/opt/rocm-4.3.0/hip/include/hip/hip_common.h'
'/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_runtime.h'
'/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_common.h' '/opt/rocm-4.3.0/hip/include/hip/hip_runtime_api.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_runtime_api.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/host_defines.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/driver_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/channel_descriptor.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_vector_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_surface_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_ldg.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_atomic.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/math_fwd.h' '/opt/rocm-4.3.0/hip/include/hip/hip_vector_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/device_library_decls.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/llvm_intrinsics.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/surface_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_fetch_functions.h' '/opt/rocm-4.3.0/hip/include/hip/hip_texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/ockl_image.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_indirect_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/math_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_fp16_math_fwd.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_memory.h' '/opt/rocm-4.3.0/hip/include/hip/library_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/library_types.h' clang-13: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument] Target //tensorflow/tools/pip_package:build_pip_package failed to build
$ ./configure $ bazel build --verbose_failures //tensorflow/tools/pip_package:build_pip_package
cc_library( name = "rocm_helpers", srcs = ["rocm_helpers.cu.cc"], hdrs = ["include/hip/hip_runtime.h"], deps = ["@local_config_rocm//rocm:rocm_headers", ], copts = rocm_copts(), alwayslink = True, )