[Bug]: rocblas link fails with relocation R_X86_64_PC32 out of range

aagit commented 1 week ago

Describe the bug

Build fails during final shared lib linking.

To Reproduce

Steps to reproduce the behavor:

build rocblas version 6.0.2 with export ROCM_GPUS="gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102"
cmake -G Ninja \ -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \ -DROCM_SYMLINK_LIBS=OFF \ -DHIP_PLATFORM=amd \ -DAMDGPU_TARGETS=${ROCM_GPUS} \ -DCMAKE_INSTALL_LIBDIR=$ROCM_LIB \ -DCMAKE_INSTALL_BINDIR=$ROCM_BIN \ -DBUILD_WITH_TENSILE=ON \ -DBUILD_WITH_PIP=OFF
See error

Expected behavior

Build should not fail.

Log-files

ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7f991): relocation R_X86_64_PC32 out of range: -2179713377 is not in [-2147483648, 2147483647]; references section '.gcc_except_table.rocblas_gemm_ex3'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa49): relocation R_X86_64_PC32 out of range: -2179713445 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa75): relocation R_X86_64_PC32 out of range: -2179713457 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z11log_profileIJRA7_KcRPS0_S2_S4_S2_S4_S2_S4_RA13_S0_S4_S2_RcS2_S7_RA2_S0_RiS9_SA_S9_SA_RA6_S0_dRA4_S0_SA_SE_SA_RA5_S0_dSE_SA_SE_SA_SG_R18rocblas_gemm_algo_RA15_S0_SA_SC_19rocblas_gemm_flags_EEvP15_rocblas_handleS3_DpOT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7faa1): relocation R_X86_64_PC32 out of range: -2179713477 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNSt8__detaillsIcSt11char_traitsIcEEERSt13basic_ostreamIT_T0_ES7_RKNS_14_Quoted_stringIPKS4_S4_EE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fad5): relocation R_X86_64_PC32 out of range: -2179713485 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z22log_bench_scalar_valueIfLi0EENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcPKT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb05): relocation R_X86_64_PC32 out of range: -2179713501 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEED2Ev'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb3d): relocation R_X86_64_PC32 out of range: -2179713529 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEEclEOS6_'
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x0): relocation R_X86_64_PC32 out of range: 2180923961 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x14): relocation R_X86_64_PC32 out of range: 2192096096 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x4): relocation R_X86_64_PC32 out of range: 2180924065 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x18): relocation R_X86_64_PC32 out of range: 2192096100 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x7fa): relocation R_X86_64_PC32 out of range: -2180446035 is not in [-2147483648, 2147483647]; references '.L.str.36'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x801): relocation R_X86_64_PC32 out of range: -2180875840 is not in [-2147483648, 2147483647]; references '.L.str.35'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x8): relocation R_X86_64_PC32 out of range: 2180924159 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x808): relocation R_X86_64_PC32 out of range: -2180734041 is not in [-2147483648, 2147483647]; references '.L.str'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x80f): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.19'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0xc): relocation R_X86_64_PC32 out of range: 2180924253 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x816): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.20'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x836): relocation R_X86_64_PC32 out of range: -2180302935 is not in [-2147483648, 2147483647]; references '.L.str.37'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb71): relocation R_X86_64_PC32 out of range: -2179713541 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNK24rocblas_internal_ostream3dupEv'
ld.lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

Environment

Should not matter, it is not a runtime issue.

Software	version
rocm-core	rocm-core-6.0.2-1.fc40.x86_64
rocblas	rocblas-6.0.2-3.fc40.x86_64

Additional context

Despite I don't see this reported among the github issues, this should be a very well known issues. So I wonder if this is planned not to be ever fixed?

If the above assumption is correct, I would like to know if upstream is willing to take in a fix for it, assuming a fix is possible.

IMbackK commented 1 week ago

You are running into the issue that ld can only link objects whos sections are at most a 32bit signed away from eatch other. as you enable more targets rocblas gets larger eventually exceeding this limit. Yes this is a huge problem with how rocm is architectured and desperately needs some kind of resultion but for now the only solution is to build for less targets.

IMbackK commented 1 week ago

If you want to remove an architecture i would recommend gfx803 as this architecture is currently broken anyhow, unless you disable the asm kernels provided by tensile.

aagit commented 1 week ago

Thanks for the quick feedback.

Yes, if I'd build for fewer targets it would succeed, but I already removed gfx1103 as I've been building for a older codebase where gfx1103 could not be enabled. So removing gfx803 will hide the problem and it would kick the can down the road, but it doesn't appear a satisfactory long term solution.

If we don't work on a solution for this now the end result is that every rocm accelerated app binary has to be built multiple times against independent and incompatible rocm builds just as if they were separate GPU compute stacks with nothing in common. This multiplies also the build time and the disk space requirements of every app, maybe not xN, but close.

It would provide a sub par experience also to the end user that has then to figure the right binary to install invoke, instead of rocm solving that gpu detail in a way that is transparent to the end user.

IMbackK commented 1 week ago

jup, this is the major reason why rocm supports so few gpus, and if they dont address this soon it has the potential to sink rocm since it forces them to drop support for old gpus exreamly fast (ever accelerating in pace as rocm get larger even) which ultimately utterly destroys customer confidence.

Mystro256 commented 1 week ago

@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.

mahmoodw commented 1 week ago

Thank you for bringing this issue to our attention. We appreciate your feedback and suggestions.

We recommend building with the suggested targets in relation to the ROCm stack. The default target list for 6.0 includes:

gfx900
gfx906:xnack-
gfx908:xnack-
gfx90a:xnack+
gfx90a:xnack-
gfx940
gfx941
gfx942
gfx1010
gfx1012
gfx1030
gfx1100
gfx1101
gfx1102

The team is aware of the issue and is exploring possible solutions.

Thank you for your understanding and cooperation.

IMbackK commented 1 week ago

@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.

sure https://llvm.org/docs/AMDGPUUsage.html#amdgpu-generic-processor-table could be used at the cost of some performance for the non gfx10-3-generic targets. Ultimately this just kicks this can further down the road, but for now yes this would be sufficient.

right now there is also no support for ELFABIVERSION_AMDGPU_HSA_V6 so those targets dont work yet, but soon i presume.

aagit commented 1 week ago

Would it be possible to split the librocblas.so.4.0 in librocblas-gfx900.so.4.0 librocblas-gfx90a.so.4.0 librocblas-gfxXYZ.so.4.0... so each individual gfx target lands in a different shared library, and then have the main librocblas.so.4.0 dynamically load only the gfx targets available in hardware either during initialization of the main library or even better lazily on demand?

TorreZuk commented 1 week ago

@aagit that separate gfx .so design has been evalutated as one possible solution but we are also looking at other strategies. For now until the full list of gfx that lands in a specific release requires a new build and packaging pattern we suggest you build and package the version specific set of gfx listed in the top level CMakeLists.txt. This corresponds to our build scripts default option.

aagit commented 1 week ago

I appreciate your suggestion above. I agree that's the least bad solution for the time being and I already gave it. If there's other ways to fix it, would you share them so they can be discussed here? Overall I would recommend to pick the simplest way to fix it and to ship it ASAP, because while working on a rocm accellerated app, I noticed that rocm has already been packaged in the open by building it N times and installing it in incompatible paths. The technical justification is to work around this issue (so it's like if there's a /opt/rocm1 /opt/rocm2 /opt/rocm3 /opt/rocmN installed, each one supporting a small subset of gfxes so that the link does not fail and gfx8 and gfx1103 can be enabled too). If the duplication was just on the rocm side it would be (perhaps) a lesser concern, but this causes all apps to be rebuilt N times and the build time is multiplied xN times. Last but not the least the end user would then have to pick the right binary (among N available) for its GPU or it won't work, and possibly just because of minor path differences. For example: I built an app linked against rocm that way and the total size of the N builds against N rocms, was 96GB. Then I run hardlink . and it dropped the size to 92GB. Then I run hardlink . -t -p and it dropped the size to 32GB. What I described in https://github.com/ROCm/rocBLAS/issues/1448#issuecomment-2186999993 is already happening. My view is that such way to package rocm it is not sustainable even if the extra energy requirements for the buildsystem could be met, because it provides a sub par experience to the end user, if compared to the competing GPU compute stacks where building an app once is enough. I already gave your above suggestion of course, but it is now a matter of opinion if the workaround is worse than the disease. So I don't see a clear path to unwind the rocm build loop until this issue fixed... Thanks!

IMbackK commented 1 week ago

another temopray option if you dont want to drop any gpus in your builds might be to build "gfx90a" or just "gfx90a:xnack-" the xnack+ configuration is very rare and omitting it do sent leave any user totally in the cold (just with possibly reduced performance depending on workload) and "gfx90a" should emit code that works in both xnack+ and xnack- modes.

IMbackK commented 1 week ago

all gfx9 gpus support xnack+, the fact that only gfx90a is built both ways is a clear hint here as to how common this is

TorreZuk commented 5 days ago

We have changed to only build our source kernels with xnack "any" for gfx90a after commit 6a267fdd2bfa9c64c4f7b08bd36025c00da605b2. We expect to adjust our gfx list before release and as always we ensure there are no linking issues on all supported OS and with any final target list. Other subdivisions of the library along functionality are also possible but none are trivial changes. Clang compiler and linker mcmodel flag changes are also possible with the current library design along with the target varations mentioned in earlier comments.

This bug should likely be considered fixed and the issue closed as when you built rocblas with our supported gfx list you didn't get the error. A new issue could be created as it is unclear to me your N different ROCm use case and why the app is rebuilt and linked against all of them and not built against the latest. If your application is open source please refer to it in your new issue and detail why it is built separately for each gfx. Or if this is really just a request to support more gfx then word it as such along with your use case and gfx list. If you rebuilt rocm or rocblas with one gfx in each version please also clarify that in your new feature request issue. It could be your new issue should be in ROCm if not particular to rocBLAS.

ROCm / rocBLAS