Closed dipietrantonio closed 3 months ago
Can you try to update and rebuild your LLVM and device-libs:
https://github.com/RadeonOpenCompute/llvm-project/tree/rocm-5.6.x https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/rocm-5.6.x
HI @dayatsin-amd , LLVM and device-libs are (should be) those of 5.6.1 release because I have downloaded all projects using the repo
tool.
As you can also see here:
-- The C compiler identification is Clang 16.0.0
-- The CXX compiler identification is Clang 16.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /software/projects/pawsey0001/cdipietrantonio/rocm5.6/rocm-5.6.1rev0/llvm/bin/clang - skipped
Anyway, I will give it another try.
the device libs cmake file contains a check for -Xclang -mcode-object-version=none option (https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/rocm-5.6.x/cmake/OCL.cmake#L36 ). Old clang does not support this option, and will cause this option disabled, then device libs will be compiled with a fixed code object version and may end up showing the issue you encountered when linked with bitcode generated with non-default code object version.
Did you use clang of rocm-5.6.x branch when building device libs?
we have fixed https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/cmake/OCL.cmake#L43 so that we unconditionally use -mcode-object-version=none. This is because the option has been added for a long time, there is no need to check its availability. If a compiler does not support this option, it will not be able to compile device libs anyway. Another reason to skip the check is that there is no guarantee that clang is built at the point of this check when device lib is built as an external project of llvm.
Hopefully, this fix will get into next ROCm release.
Thanks for the update, we will try the new version as soon as possible!
Hi @yxsamliu I have tried patching the cmake
with the appropriate line and now I am no longer getting the original issue I was running into but am now encountering another issue, namely that building the ROCR-Runtime
I am getting undefined symbol __llvm_amdgcn_image_load_*
like errors (see below)
BUILDING bitcode for ocl_blit_object_gfx700...
cd /scratch/pawsey0001/spack/rocm-build-gcc-nolibc/rocm-5.7.0-gcc/ROCR-Runtime/src/build/image/blit_src && /software/setonix/2023.08/pawsey/software/rocm/gcc/12.2.0/rocm-5.7.0rev1/llvm/bin/clang-17 -O2 -x cl -Xclang -finclude-default-header -cl-denorms-are-zero -cl-std=CL2.0 -target amdgcn-amd-amdhsa -mcpu=gfx700 -mcode-object-version=4 -o ocl_blit_object_gfx700 /scratch/pawsey0001/spack/rocm-build-gcc-nolibc/rocm-5.7.0-gcc/ROCR-Runtime/src/image/blit_src/imageblit_kernels.cl
ld.lld: error: undefined symbol: __llvm_amdgcn_image_load_2darray_v4f32_i32
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image_float)
>>> referenced 9 more times
ld.lld: error: undefined symbol: __llvm_amdgcn_image_load_1darray_v4f32_i32
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image_float)
>>> referenced 9 more times
ld.lld: error: undefined symbol: __llvm_amdgcn_image_load_3d_v4f32_i32
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image_float)
>>> referenced 9 more times
ld.lld: error: undefined symbol: __llvm_amdgcn_image_load_2d_v4f32_i32
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image_float)
>>> referenced 9 more times
ld.lld: error: undefined symbol: __llvm_amdgcn_image_load_1d_v4f32_i32
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image)
>>> referenced by /tmp/imageblit_kernels-55e2a1.o:(read_image_float)
>>> referenced 11 more times
Do you have some idea what the underlying issue might be?
For completness, the configuration was
cmake -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<local_path> -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCMAKE_Fortran_COMPILER=ftn -DCMAKE_CXX_FLAGS=-fuse-ld=lld -DBITCODE_DIR=<pathtollvsm>/llvm/amdgcn/bitcode ..
-- The C compiler identification is Clang 17.0.0
-- The CXX compiler identification is Clang 17.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /software/setonix/2023.08/pawsey/software/rocm/gcc/12.2.0/rocm-5.7.0rev1/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /software/setonix/2023.08/pawsey/software/rocm/gcc/12.2.0/rocm-5.7.0rev1/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
fatal: Not a valid object name origin/HEAD
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.2")
-- Found LibElf: /usr/lib64/libelf.so
-- Performing Test ELF_GETSHDRSTRNDX
-- Performing Test ELF_GETSHDRSTRNDX - Success
-- Checking for module 'libdrm'
-- Found libdrm, version 2.4.111
-- Looking for __NR_memfd_create
-- Looking for __NR_memfd_create - found
-- Performing Test Terminfo_LINKABLE
-- Performing Test Terminfo_LINKABLE - Success
-- Found Terminfo: /usr/lib64/libtinfo.so
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.11")
-- Found zstd: /usr/lib64/libzstd.so
-- Found LibXml2: /usr/lib64/libxml2.so (found version "2.9.14")
Using CPACK_DEBIAN_PACKAGE_RELEASE local
RESULT_VARIABLE 0 OUTPUT_VARIABLE:
CPACK_RPM_PACKAGE_RELEASE: local%{?dist}
-- Configuring done
-- Generating done
@b-sumner Did device libs use to have __llvm_amdgcn_image_load_2d_v4f32_i32 ? was it removed in rocm 5.7?
Those functions were not removed, but their definitions were moved. It still appears something is not consistent in this build.
rocm 5.7.1 seems not using these functions https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/docs/5.7.1/src/image/blit_src/imageblit_kernels.cl
Is the right branch of rocm used?
This error is from builds of 5.7.0.
I did try 5.7.1 and got
Running command git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git
Cloning into 'hipRAND'...
fatal: Remote branch rocm-5.7.1 not found in upstream origin
Error running a command: git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git
from the commands
git clone -b rocm-${ROCM_VERSION} https://github.com/ROCmSoftwarePlatform/hipRAND.git
So I assume that from 5.7.1 and onwards there is no hipRAND?
Aslo, I note that (by ignoring hipRand), I get the same errors in 5.7.1 as 5.7.0. So how can I get a consistent build with 5.7.x?
Note that since the errors have changed drastically, feel free to close this issue as I will raise another one.
This error is from builds of 5.7.0.
I did try 5.7.1 and got
Running command git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git Cloning into 'hipRAND'... fatal: Remote branch rocm-5.7.1 not found in upstream origin Error running a command: git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git
from the commands
git clone -b rocm-${ROCM_VERSION} https://github.com/ROCmSoftwarePlatform/hipRAND.git
So I assume that from 5.7.1 and onwards there is no hipRAND?
Better raise the issue at hipRand github. Since there is ROCm 6.0 branch, I tend to think it will continue.
This error is from builds of 5.7.0.
I did try 5.7.1 and got
Running command git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git Cloning into 'hipRAND'... fatal: Remote branch rocm-5.7.1 not found in upstream origin Error running a command: git clone -b rocm-5.7.1 https://github.com/ROCmSoftwarePlatform/hipRAND.git
from the commands
git clone -b rocm-${ROCM_VERSION} https://github.com/ROCmSoftwarePlatform/hipRAND.git
So I assume that from 5.7.1 and onwards there is no hipRAND?
There are missing/incorrect tags for hipRAND. See https://github.com/ROCmSoftwarePlatform/hipRAND/issues/88 and https://github.com/ROCmSoftwarePlatform/hipRAND/issues/85.
I think the correct branch is this one: https://github.com/ROCmSoftwarePlatform/hipRAND/tree/release/rocm-rel-5.7. However, there's no documentation to be sure of this.
Those functions were not removed, but their definitions were moved. It still appears something is not consistent in this build.
Is it possible there are incorrectly-tagged releases for some components?
This appears to be the case for hipRAND (https://github.com/ROCmSoftwarePlatform/hipRAND/issues/85). Because of this, I am concerned this may be the case for other components. If so, that could explain an inconsistent build even if everything was checked out using the manifest file.
@dipietrantonio Has your issue been resolved? If so, please close the ticket. Thanks!
I am trying to build ROCm 5.6.1 from source and I hit the following error:
Here is my CMake command line options.