Closed opcod3 closed 6 months ago
I just did some more tests and the issue can be reproduced in the following docker images:
rocm/dev-ubuntu-22.04:5.5.1-complete
rocm/dev-ubuntu-22.04:5.4.2-complete
rocm/dev-ubuntu-22.04:5.3-complete
I did not test any other versions but I assume the bug should be present in all versions since at least ROCm-5.3
I can confirm that this issue exists when the example above is executed with any combination of MI25, MI50 and rx6800xt but dose not exist (as expected) when only two MI50 are present.
Building rocBLAS without tensile (BUILD_WITH_TENSILE=OFF
) appears to fix the issue
I also have this issue with a 6800xt and a Vega64
And I also experience similar issues when using multi-gpu Torch with ROCm. Have a collection of my errors and debugging for the torch experience here: https://rentry.org/tcahd
Doing some more troubleshooting, apparently calling rocblas_initialize()
before using any other functions fixes the issue.
Doing some more troubleshooting, apparently calling
rocblas_initialize()
before using any other functions fixes the issue.
this is true for AMD but I've had people report that it will break hipBLAS usage on NVIDIA and Intel GPUs since its a call to rocblas instead of a hip function
its also a optional call, this is still a serious bug.
Thanks for reporting the issue. We are currently investigating the issue and will provide an update as soon as possible.
rocblas_initialize()
does load all the Tensile code objects ( for all supported GFX ISA Targets), hence the results are as expected when rocblas_initialize()
is used.
I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.
@opcod3 , Thanks for bringing this to our notice, a fix has been merged and should be available in future release, rocBLAS Commit ID: https://github.com/ROCmSoftwarePlatform/rocBLAS/commit/bc4d8f57ec6b3b2c91c4eaa5351bcc35ced66d52 Tensile Commit ID: https://github.com/ROCmSoftwarePlatform/Tensile/commit/24d54d7644bd20e6855aa94a1262aae1d8269767
I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.
@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.
I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.
@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.
Please note that this page explains the ROCm roadmap and current versions...
https://github.com/RadeonOpenCompute/ROCm/releases
Note 5.7 is the last in the series, based on their roadmap -
and that 6.0 may not be comparable with the 5.x versions.
I'd like to suggest as it is a simple fix ( to existing bug )
and as there may not be a 5.7.1 version due to the roadmap,
that is be added to the 5.7 version to minimize wait.
I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.
@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.
@rkamd I can recompile rocblas/tensile with the patch, that is not the issue, i am merely worried about rocm stability policies as it would appear from the outside that there is no internal mechanism to block a release when a serious issue is found. I don't see what issue besides "silently returns incorrect results for every operation on a supported platform 100% of the time" could possibly be more serious in the world of scientific compute.
I also concur with @opcod3 that its worrying that the rocm runtime dose not throw an error when a kernel launch fails due to the arch being wrong, but instead silently continues with garbage data and only logs this as a warning. In my option a failed kernel launch of this kind should cause an assert. Please confirm whether you have raised this problem as a bug internally or not. As otherwise i would like to file a bug against the runtime.
I would also respectfully request that a system with heterogeneous architecture is included in internal conformance testing, if such a system is not available already.
That said, thank you for fixing this issue and including the unsupported legacy platforms in the fix, your (and AMD's in general) efforts in providing an open source compute platform are much appreciated. Indeed great progress has been made in this direction in recent years.
I'd like to report this issue appears resolved for me at this time! Here's the guide I wrote with the instructions I used and have it working - https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md
First of all, this is the wrong report. A clear and concise description of what the problem is. -- OS detected is ubuntu /usr/bin/python3.8 -m venv /root/workspace/rocBLAS/build/virtualenv --system-site-packages --clear The virtual environment was not created successfully because ensurepip is not available. On Debian/Ubuntu systems, you need to install the python3-venv package using the following command.
apt install python3.8-venv You may need to use sudo with that command. After installing the python3-venv package, recreate your virtual environment.
Failing command: ['/root/workspace/rocBLAS/build/virtualenv/bin/python3.8', '-Im', 'ensurepip', '--upgrade', '--default-pip']
CMake Error at cmake/virtualenv.cmake:23 (message): 1 Call Stack (most recent call first): cmake/virtualenv.cmake:49 (virtualenv_create) CMakeLists.txt:139 (virtualenv_install)
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. CMake Error at cmake/virtualenv.cmake:68 (message): 1 Call Stack (most recent call first): CMakeLists.txt:139 (virtualenv_install) Then I use pip install-- upgrade setuptools. Update Installing collected packages: setuptools Attempting uninstall: setuptools Found existing installation: setuptools 69.0.2 Uninstalling setuptools-69.0.2: Successfully uninstalled setuptools-69.0.2 Successfully installed setuptools-69.0.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. CMake Error at cmake/virtualenv.cmake:68 (message): 1 Call Stack (most recent call first): CMakeLists.txt:139 (virtualenv_install)
Could you help me to solve this problem ,thank you very much!
@xiaobo1025 please to dont spam this bug with unrelated issues
@rkamd I can confirm this seams to be fixed in 6.0
@IMbackK , Thanks for verifying.
Closing this issue.
Describe the bug
rocBLAS returns incorrect results when used on two GPUs with different architectures.
This issue was first encountered in turboderp/exllama#173, while the provided code to reproduce is based off of rocBLAS-Examples.
When using rocBLAS and performing computations on two GPUs with different architectures the first computation on each card will be correct. While any subsequent ones performed on the first card will be incorrect.
To Reproduce
Steps to reproduce the behavior:
Ensure the current system has at least two GPUs and that the architecture of GPU0 is different from GPU1
Install ROCm and ROCblas v5.6.0 (also present on 5.5.1, possibly earlier as well)
Run
make
to compile the example code (bug-report.zip)Run
./gemm
Observe how the first two calculations pass while the all the subsequent ones that execute on GPU0 fail
Expected behavior
It is expected that all calculations complete correctly.
Log-files
Running
AMD_LOG_LEVEL=2 ./gemm
produces the following logI believe the key log entries are the following:
Environment
environment.txt
This has also been reproduced in the
rocm/dev-ubuntu-22.04:5.5.1-complete
docker container.Additional context
According to other users in turboderp/exllama#173 the issues also occurs between Mi25 and Mi50 cards. I can also report it also occurs between any combination of the two cards I listed above and a 7900XTX.
Inverting the order of the computations (running a calculation on GPU1 first and then on GPU0) results in the same exact behavior, but with the failing card being GPU1 instead of GPU0 as before.
From looking at more logs and rocBLAS internals i believe the error is related to the Tensile library. The behavior encountered seems to indicate that when a second
.hsaco
file is loaded it somehow overrides the original one with the correct architecture for the first card. I am unsure if this is an issue in Tensile itself or in the way rocBLAS uses it.In my opinion attempting to execute a kernel with an incorrect architecture should produce a crash or an error, instead of carrying on as normal and returning incorrect results.