ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
688 stars 94 forks source link

Undeclared inclusions building from source #1467

Open mojitonoproblem opened 3 years ago

mojitonoproblem commented 3 years ago

System information

- Python version:

python --version Python 3.9.7

- Installed using virtualenv? pip? conda?:
conda
- Bazel version (if compiling from source):

$ bazel --version bazel 3.7.2


- GCC/Compiler version (if compiling from source):

$ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

- ROCm/MIOpen version:
4.3.0

- GPU model and memory:
  Marketing Name:          Hawaii PRO [Radeon R9 290/390]                                                                                                                

**Describe the problem**

Trying to build, bazel gives the following error message:

ERROR: /home/minion/tensorflow-upstream/tensorflow/stream_executor/rocm/BUILD:393:11: undeclared inclusion(s) in rule '//tensorflow/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/stream_executor/rocm/rocm_helpers.cu.cc':
'/opt/rocm-4.3.0/hip/include/hip/hip_runtime.h'
'/opt/rocm-4.3.0/hip/include/hip/hip_version.h'
'/opt/rocm-4.3.0/hip/include/hip/hip_common.h'
'/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_runtime.h'
'/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_common.h' '/opt/rocm-4.3.0/hip/include/hip/hip_runtime_api.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_runtime_api.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/host_defines.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/driver_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/channel_descriptor.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_vector_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_surface_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_ldg.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_atomic.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/math_fwd.h' '/opt/rocm-4.3.0/hip/include/hip/hip_vector_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/device_library_decls.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/llvm_intrinsics.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/surface_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_fetch_functions.h' '/opt/rocm-4.3.0/hip/include/hip/hip_texture_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/ockl_image.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/texture_indirect_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/math_functions.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_fp16_math_fwd.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/hip_memory.h' '/opt/rocm-4.3.0/hip/include/hip/library_types.h' '/opt/rocm-4.3.0/hip/include/hip/amd_detail/library_types.h' clang-13: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument] Target //tensorflow/tools/pip_package:build_pip_package failed to build

**Provide the exact sequence of commands / steps that you executed before running into the problem**

$ ./configure $ bazel build --verbose_failures //tensorflow/tools/pip_package:build_pip_package


**Any other info / logs**
I tried declaring the dependency by issuing:
`sudo ln -s /opt/rocm/include/ tensorflow/stream_executor/rocm/include
`

cc_library( name = "rocm_helpers", srcs = ["rocm_helpers.cu.cc"], hdrs = ["include/hip/hip_runtime.h"], deps = ["@local_config_rocm//rocm:rocm_headers", ], copts = rocm_copts(), alwayslink = True, )



but it only leads to another error (duplicate declaration).
reza-amd commented 3 years ago

Could you please run the build_rocm_python3 script to start the build (it is located in the root folder of the repository)?

xuhuisheng commented 3 years ago

@mojitonoproblem Just curious, Does hawaii r290/r390 can run properly on ROCm-4.3.0? ROCm teams said only ROCm-1.9.3 supports Hawaii, which released on 2018. Do you test any small samples, for examples, hip square sample?

reza-amd commented 3 years ago

@mojitonoproblem I did not notice the GPU model you mentioned in the initial comment. Let me ping a member of the team for further assistance. cc @sunway513

mojitonoproblem commented 3 years ago

@mojitonoproblem Just curious, Does hawaii r290/r390 can run properly on ROCm-4.3.0? ROCm teams said only ROCm-1.9.3 supports Hawaii, which released on 2018. Do you test any small samples, for examples, hip square sample?

I didn't know that. Please let me know how to run those samples and I will post the results. Thank you

mojitonoproblem commented 3 years ago

@mojitonoproblem I did not notice the GPU model you mentioned in the initial comment. Let me ping a member of the team for further assistance. cc @sunway513

Perfect. I hardcoded the path to the ROCM installation directory and the Python bin, and it is building so far.

mojitonoproblem commented 3 years ago

@reza-amd I'm pasting the error resulting from running build_rocm_python3:

ERROR: /home/minion/.cache/bazel/_bazel_minion/95be990c2bc0fe49a10affcebca4a754/external/local_config_rocm/rocm/BUILD:129:11: @local_config_rocm//rocm:rocprim: missing input file 'external/local_config_rocm/rocm/rocm/include/hipcub/hipcub_version.hpp', owner: '@local_config_rocm//rocm:rocm/include/hipcub/hipcub_version.hpp'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /home/minion/.cache/bazel/_bazel_minion/95be990c2bc0fe49a10affcebca4a754/external/local_config_rocm/rocm/BUILD:129:11 2 input file(s) do not exist
INFO: Elapsed time: 5816.222s, Critical Path: 115.56s
INFO: 3324 processes: 109 internal, 3215 local.
FAILED: Build did NOT complete successfully

Let me know if I can do anything to help. Thank you

deven-amd commented 3 years ago

@mojitonoproblem do you have the ROCM_PATH and TF_NEED_ROCM env vars set when you run the configure command? (for e.g. - https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/build_rocm_python3#L31 )

If not, please set, and retry If you are setting them and still running into the error, please paste the .tf_configure.bazelrc file here

Your .tf_configure.bazelrc should look something like

root@ixt-rack-04:/root/tensorflow# cat .tf_configure.bazelrc 
build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib/python3/dist-packages"
build --python_path="/usr/bin/python3"
build --config=rocm
build --action_env ROCM_PATH="/opt/rocm-4.3.1"
build --action_env ROCBLAS_TENSILE_LIBPATH="/opt/rocm-4.3.1/lib/library"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_env=LD_LIBRARY_PATH
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu,-v1only

note the --action_env ROCM_PATH=... line

mojitonoproblem commented 3 years ago

@reza-amd Thank you for your kind help. I'm pasting the error resulted from the last attempt (with env variables and python path set):

ERROR: /home/minion/tensorflow-upstream/tensorflow/core/data/service/BUILD:544:23: //tensorflow/core/data/service:server_lib_headers_lib: missing inpu│··················
t file 'external/local_config_rocm/rocm/rocm/include/hipcub/hipcub_version.hpp', owner: '@local_config_rocm//rocm:rocm/include/hipcub/hipcub_version.h│··················
pp'                                                                                                                                                   │··················
Target //tensorflow/tools/pip_package:build_pip_package failed to build                                                                               │··················
ERROR: /home/minion/tensorflow-upstream/tensorflow/core/data/service/BUILD:544:23 2 input file(s) do not exist                                        │··················
INFO: Elapsed time: 5960.307s, Critical Path: 97.00s                                                                                                  │··················
INFO: 3406 processes: 56 internal, 3350 local.                                                                                                        │··················
FAILED: Build did NOT complete successfully                                                                                                           │··················

as well as .tf_configure.bazelrc, as requested:

$ cat tensorflow-upstream/.tf_configure.bazelrc 
build --action_env PYTHON_BIN_PATH="/home/minion/anaconda3/envs/ai/bin/python"
build --action_env PYTHON_LIB_PATH="/home/minion/anaconda3/envs/ai/lib/python3.9/site-packages"
build --python_path="/home/minion/anaconda3/envs/ai/bin/python"
build --config=rocm
build --action_env ROCM_PATH="/opt/rocm-4.3.0"
build --action_env ROCBLAS_TENSILE_LIBPATH="/opt/rocm-4.3.0/lib/library"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_env=LD_LIBRARY_PATH
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu,-v1only

Thank you,

deven-amd commented 3 years ago

looks like you are using ROCm 4.3.0....can I request you to switch to ROCm 4.3.1 and try it out. thanks

mojitonoproblem commented 3 years ago

@deven-amd It seems that there is a missing library, although it is actually installed:

$ /opt/rocm/bin/rocminfo
/opt/rocm/bin/rocminfo: error while loading shared libraries: libhsakmt.so.1: cannot open shared object file: No such file or directory
$ dpkg -L hsakmt-roct
/opt
/opt/rocm-4.3.0
/opt/rocm-4.3.0/lib
/opt/rocm-4.3.0/lib/libhsakmt.so.1.0.40300
/opt/rocm-4.3.0/share
/opt/rocm-4.3.0/share/doc
/opt/rocm-4.3.0/share/doc/hsakmt
/opt/rocm-4.3.0/share/doc/hsakmt/LICENSE.md
/opt/rocm-4.3.0/lib/libhsakmt.so
/opt/rocm-4.3.0/lib/libhsakmt.so.1

I cannot manage to get it work. Thanks

mojitonoproblem commented 3 years ago

Please note that even with the repository 4.3.1, the installation directory is named 4.3.0.

jayfurmanek commented 3 years ago

For ROCm 4.3.1, the directory should be /opt/rocm-4.3.1

# dpkg -L hsakmt-roct
/opt
/opt/rocm-4.3.1
/opt/rocm-4.3.1/lib
/opt/rocm-4.3.1/lib/libhsakmt.so.1.0.40301
/opt/rocm-4.3.1/share
/opt/rocm-4.3.1/share/doc
/opt/rocm-4.3.1/share/doc/hsakmt
/opt/rocm-4.3.1/share/doc/hsakmt/LICENSE.md
/opt/rocm-4.3.1/lib/libhsakmt.so
/opt/rocm-4.3.1/lib/libhsakmt.so.1

I think maybe your ROCm install is not quite right. Perhaps try removing it altogether and putting on 4.3.1 fresh.

jayfurmanek commented 3 years ago

Hi @mojitonoproblem, Were you able to get a fresh rocm 4.3.1 install and try again?

mojitonoproblem commented 3 years ago

Hi @jayfurmanek I was not able yet. I didn't want to add useless info to this thread. I started with a new install but cannot get it work. I'm gona try again today.

$ /opt/rocm-4.3.1/bin/rocminfo 
ROCk module is loaded
Unable to open /dev/kfd read-write: Cannot allocate memory
$ dmesg | grep kfd
$ 

Thanks.

mojitonoproblem commented 3 years ago

I uninstalled, rebooted and reinstalled everything, but cannot get rid of the previous error message. Any hints? Thanks in advance

jayfurmanek commented 3 years ago

It seems your ROCm install is still not right. How did you remove/reinstall? Maybe that will shed some light on what is going on here.

Also, note we did just move the top of the develop-upstream branch to be ROCm-4.5 based if you are still interested in building the latest.

mojitonoproblem commented 3 years ago

@jayfurmanek thank you, I removed the 4.3.0 version using sudo apt purge rocm* comgr rock-dkms, then changed the apt source to point to 4.3.1 and then sudo apt install rocm-dkms (after apt update of course).

Now I'm pulling develop-upstream and changing apt source to 4.5. After attempting to build I'll post my results. Thanks

mojitonoproblem commented 3 years ago

It appears that it cannot find rock-dkms. I just rebooted after removing 4.3.1 and issued the following command:

$ sudo apt install rocm-dkms
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 rocm-dkms : Depends: rock-dkms but it is not installable
E: Unable to correct problems, you have held broken packages.
AliJahan commented 1 year ago

@mojitonoproblem I was able to find a workaround for this issue (I am not sure if it is the right way). I added: "-I/opt/rocm/include/" to cops in the bazel file. After adding this, it got compiled successfully! Hope it helps you as well!