ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
340 stars 157 forks source link

[Bug]: rocBLAS fails tests badly in FP16 for distro packages #1350

Closed littlewu2508 closed 2 months ago

littlewu2508 commented 1 year ago

Describe the bug

Distro rocBLAS-5.6.0 (compiled with upstream llvm-16) fails many FP16 related tests. Both seen on MI210 and Radeon VII. Details can be seen in gzipped test.log:

MI210-test.log.gz RadeonVII-test.log.gz

The build log is also appended: MI210-build.log.gz RadeonVII-build.log.gz

rkamd commented 1 year ago

@littlewu2508 , Could you update some of the missing information such as build log, environment.txt etc., to further investigate the issue. Please refer to the Bug template here

littlewu2508 commented 1 year ago

To Reproduce

This result comes from running src_test in Gentoo sc-libs/rocBLAS-5.6.0 package. Currently the package is in this test branch

In Gentoo system, you can replace the default repo with this experiment branch, then build and test rocBLAS:

cd /var/db/repos
mv gentoo{,.bak}
git clone -b rocm-5.6 https://github.com/littlewu2508/gentoo.git 
echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf
mkdir -p /etc/portage/env /etc/portage/package.use
echo 'FEATURES=test' > /etc/portage/env/test.conf
echo 'sci-libs/rocBLAS test.conf' >> /etc/portage/package.env
emerge "=sci-libs/rocBLAS-5.6.0"

Expected behavior

All tests pass.

Log-files

The complete build-and-test log is MI210-test.log.gz MI210-build.log.gz RadeonVII-build.log.gz RadeonVII-test.log.gz

Environment

There are two environments

MI210

Hardware description
CPU AMD EPYC 7763
GPU AMD Instinct MI210
Software version
kernel Debian 6.1.27-1 (2023-05-08) x86_64
llvm/clang Gentoo 16.0.6
rocm-core Gentoo rocm-5.6.0
rocblas Gentoo rocm-5.6.0

MI210-environment.txt

Radeon VII

Hardware description
CPU AMD Ryzen 7 5800X
GPU AMD Radeon VII
Software version
kernel Linux 6.3.2
llvm/clang Gentoo 16.0.6
rocm-core Gentoo rocm-5.6.0
rocblas Gentoo rocm-5.6.0

RadeonVII-environment.txt

rkamd commented 1 year ago

@littlewu2508 , I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error (masked by: ~amd64 keyword)

I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?

I was not able to reproduce this issue using ROCm 5.6 in Ubuntu

littlewu2508 commented 1 year ago

@littlewu2508 , I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error (masked by: ~amd64 keyword)

Sorry I made a mistake in reproducing steps. Try adding ACCEPT_KEYWORDS="amd64" to echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf

I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?

I was not able to reproduce this issue using ROCm 5.6 in Ubuntu

If you're using the official ROCm stack shipped by repo.radeon.com and with upstream kernel installed, then you shouldn't encounter this issue. I does not reproduce it as well on Debian12 with .deb from repo.radeon.com installed. So I guess it's Gentoo use upstream LLVM that causes all discrepancies.

rkamd commented 1 year ago

@littlewu2508, Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.

Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439

For reproducing the error, you could use the sample program provided here in Gentoo environment.

And maybe you could try this suggestion to verify if it resolves the issue

littlewu2508 commented 1 year ago

@littlewu2508, Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.

Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439

For reproducing the error, you could use the sample program provided here in Gentoo environment.

And maybe you could try this suggestion to verify if it resolves the issue

Thank you very much for these suggestions. I have also reproduced the float16.cpp issue, only -O3 generate sensible outputs. I will keep tracking https://github.com/ROCmSoftwarePlatform/rocFFT/issues/439

rkamd commented 1 year ago

@littlewu2508 , Fedoro fix for half precisions is below: https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch

littlewu2508 commented 1 year ago

@littlewu2508 , Fedoro fix for half precisions is below: https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch

Thank you! Is this patch submitted to llvm-project upstream?

rkamd commented 2 months ago

@littlewu2508, Do you still need any assistance from rocBLAS ? if not please feel free to close this ticket.