Open aagit opened 1 week ago
You are running into the issue that ld can only link objects whos sections are at most a 32bit signed away from eatch other. as you enable more targets rocblas gets larger eventually exceeding this limit. Yes this is a huge problem with how rocm is architectured and desperately needs some kind of resultion but for now the only solution is to build for less targets.
If you want to remove an architecture i would recommend gfx803 as this architecture is currently broken anyhow, unless you disable the asm kernels provided by tensile.
Thanks for the quick feedback.
Yes, if I'd build for fewer targets it would succeed, but I already removed gfx1103 as I've been building for a older codebase where gfx1103 could not be enabled. So removing gfx803 will hide the problem and it would kick the can down the road, but it doesn't appear a satisfactory long term solution.
If we don't work on a solution for this now the end result is that every rocm accelerated app binary has to be built multiple times against independent and incompatible rocm builds just as if they were separate GPU compute stacks with nothing in common. This multiplies also the build time and the disk space requirements of every app, maybe not xN, but close.
It would provide a sub par experience also to the end user that has then to figure the right binary to install invoke, instead of rocm solving that gpu detail in a way that is transparent to the end user.
jup, this is the major reason why rocm supports so few gpus, and if they dont address this soon it has the potential to sink rocm since it forces them to drop support for old gpus exreamly fast (ever accelerating in pace as rocm get larger even) which ultimately utterly destroys customer confidence.
@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.
Thank you for bringing this issue to our attention. We appreciate your feedback and suggestions.
We recommend building with the suggested targets in relation to the ROCm stack. The default target list for 6.0 includes:
The team is aware of the issue and is exploring possible solutions.
Thank you for your understanding and cooperation.
@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.
sure https://llvm.org/docs/AMDGPUUsage.html#amdgpu-generic-processor-table could be used at the cost of some performance for the non gfx10-3-generic targets. Ultimately this just kicks this can further down the road, but for now yes this would be sufficient.
right now there is also no support for ELFABIVERSION_AMDGPU_HSA_V6 so those targets dont work yet, but soon i presume.
Would it be possible to split the librocblas.so.4.0 in librocblas-gfx900.so.4.0 librocblas-gfx90a.so.4.0 librocblas-gfxXYZ.so.4.0... so each individual gfx target lands in a different shared library, and then have the main librocblas.so.4.0 dynamically load only the gfx targets available in hardware either during initialization of the main library or even better lazily on demand?
@aagit that separate gfx .so design has been evalutated as one possible solution but we are also looking at other strategies. For now until the full list of gfx that lands in a specific release requires a new build and packaging pattern we suggest you build and package the version specific set of gfx listed in the top level CMakeLists.txt. This corresponds to our build scripts default option.
I appreciate your suggestion above. I agree that's the least bad solution for the time being and I already gave it. If there's other ways to fix it, would you share them so they can be discussed here? Overall I would recommend to pick the simplest way to fix it and to ship it ASAP, because while working on a rocm accellerated app, I noticed that rocm has already been packaged in the open by building it N times and installing it in incompatible paths. The technical justification is to work around this issue (so it's like if there's a /opt/rocm1 /opt/rocm2 /opt/rocm3 /opt/rocmN installed, each one supporting a small subset of gfxes so that the link does not fail and gfx8 and gfx1103 can be enabled too). If the duplication was just on the rocm side it would be (perhaps) a lesser concern, but this causes all apps to be rebuilt N times and the build time is multiplied xN times. Last but not the least the end user would then have to pick the right binary (among N available) for its GPU or it won't work, and possibly just because of minor path differences. For example: I built an app linked against rocm that way and the total size of the N builds against N rocms, was 96GB. Then I run hardlink .
and it dropped the size to 92GB. Then I run hardlink . -t -p
and it dropped the size to 32GB. What I described in https://github.com/ROCm/rocBLAS/issues/1448#issuecomment-2186999993 is already happening. My view is that such way to package rocm it is not sustainable even if the extra energy requirements for the buildsystem could be met, because it provides a sub par experience to the end user, if compared to the competing GPU compute stacks where building an app once is enough. I already gave your above suggestion of course, but it is now a matter of opinion if the workaround is worse than the disease. So I don't see a clear path to unwind the rocm build loop until this issue fixed... Thanks!
another temopray option if you dont want to drop any gpus in your builds might be to build "gfx90a" or just "gfx90a:xnack-" the xnack+ configuration is very rare and omitting it do sent leave any user totally in the cold (just with possibly reduced performance depending on workload) and "gfx90a" should emit code that works in both xnack+ and xnack- modes.
all gfx9 gpus support xnack+, the fact that only gfx90a is built both ways is a clear hint here as to how common this is
We have changed to only build our source kernels with xnack "any" for gfx90a after commit 6a267fdd2bfa9c64c4f7b08bd36025c00da605b2. We expect to adjust our gfx list before release and as always we ensure there are no linking issues on all supported OS and with any final target list. Other subdivisions of the library along functionality are also possible but none are trivial changes. Clang compiler and linker mcmodel flag changes are also possible with the current library design along with the target varations mentioned in earlier comments.
This bug should likely be considered fixed and the issue closed as when you built rocblas with our supported gfx list you didn't get the error. A new issue could be created as it is unclear to me your N different ROCm use case and why the app is rebuilt and linked against all of them and not built against the latest. If your application is open source please refer to it in your new issue and detail why it is built separately for each gfx. Or if this is really just a request to support more gfx then word it as such along with your use case and gfx list. If you rebuilt rocm or rocblas with one gfx in each version please also clarify that in your new feature request issue. It could be your new issue should be in ROCm if not particular to rocBLAS.
Describe the bug
Build fails during final shared lib linking.
To Reproduce
Steps to reproduce the behavor:
Expected behavior
Build should not fail.
Log-files
Environment
Should not matter, it is not a runtime issue.
Additional context
Despite I don't see this reported among the github issues, this should be a very well known issues. So I wonder if this is planned not to be ever fixed?
If the above assumption is correct, I would like to know if upstream is willing to take in a fix for it, assuming a fix is possible.