pytorch train FC hang on ROCm-4.5.0 with gfx803

xuhuisheng commented 3 years ago

@cgmb Please help to have a look this bug. Thank you very much. If you haven't enough time, please give me some clue. I am really not familiar with GCN assembles. (T_T)

What is the expected behavior

train FC properly

What actually happens

hang with GPU 100%

How to reproduce

train sample FC on pytorch-1.10.0
https://github.com/xuhuisheng/rocm-build/blob/master/check/test-pytorch-rocblas.py

Environment

Hardware	description
GPU	RX580
GPU chip ID	0x67df
CPU	E5-2620 v3

Software	version
ROCK	v4.5.0
ROCR	v4.5.0
HCC	v4.5.0
Library	v4.5.0

workaround

If I delete library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml and rebuild rocBLAS. This problem resolved. Here is my patched rocblas https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm450/rocblas_2.41.0-337552f0.dirty_amd64.deb

etc

Here is my personal issue for tracing this issue on ROCm-4.0.1 https://github.com/xuhuisheng/rocm-build/issues/4

Here is my building documentation for gfx803. Glade to see there is only one issue on ROCm-4.5.0 with gfx803. https://github.com/xuhuisheng/rocm-build/tree/master/gfx803

I try some other scripts :

The mnist on tensorflow with gfx803 shows NaN loss. The text-classification of tensorflow shows Invalid argument: indices[5,284] = 997212422 is not in [0, 5001) (text classification). https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1172

I guess this problem is caused by the new assemble codes of Tensile, we can see after I delete n9nano related tensile yaml, the legal c gemm implements run properly. But I dont know how to reproduce it with plain c codes. But it affects the pytorch/tensorflow.

cgmb commented 3 years ago

This appears to be caused by out-of-bounds writes in the gemm batched routines for gfx803. By deleting the Tensile yaml files for that architecture, you fall back to source kernels rather than using the assembly kernels. In my tests, the source kernels appear to have fewer failures, but I'm not sure that this entirely solves the problem. I still see failures when running rocblas-test --gtest_filter='*checkin*-*known_bug*' on my RX 570.

That's about all the investigation I have time for, but I'll organize and post my logs here later.

cgmb commented 3 years ago

You can see what kernels are run using the TENSILE_DB environment variable. I used that to determine exactly what Tensile kernels that your example calls.

Using rocBLAS built with the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT16x16x16_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW2_GSU8_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_8_4_WGM1
Running kernel: Cijk_Ailk_Bljk_SB_MT32x8x64_SN_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS0_FL0_GRVW2_GSU16_ISA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT4_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WS64_WG8_4_8_WGM1

Using rocBLAS built without the r9nano yaml files:

Running kernel: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_AMAS0_BL0_GRVW1_GSU1_K1_LRVW1_NLCA1_NLCB1_PGR0_PLR0_SU32_SUS256_SVW4_TT8_4_USFGRO0_VAW1_VS1_VW1_WG16_16_1_WGM8

cgmb commented 3 years ago

These are my logs running rocblas-test on the RX 570 with and without the r9nano yaml files: gemm_test_logs.zip

Unfortunately, in both cases you see failures like these:

/root/rocBLAS/clients/gtest/../include/d_vector.hpp:118: Failure
Expected equality of these values:
  memcmp(host, guard, sizeof(guard))
    Which is: -112
  0
[  FAILED  ] pre_checkin/gemm.blas3_tensile/gemm_medium_f16_r_NN_4_4_4_5_4_4_0_4, where GetParam() = { M: 4, N: 4, K: 4, KL: 128, KU: 128, lda: 4, ldb: 4, ldc: 4, ldd: 4, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, compute_type: f16_r, incx: 0, incy: 0, incd: 0, incb: 0, alpha: 5.0, alphai: 0.0, beta: 0.0, betai: 0.0, transA: 'N', transB: 'N', side: '*', uplo: '*', diag: '*', batch_count: 1, HMM: false, threads: 0, streams: 0, devices: 0, stride_a: 16, stride_b: 16, stride_c: 16, stride_d: 16, stride_x: 0, stride_y: 0, fortran: false, norm_check: 0, unit_check: 1, timing: 0, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: 0, function: "gemm", name: "gemm_medium", category: "pre_checkin", initialization: "rand_int", known_bug_platforms: "", c_noalias_d: false, atomics_mode: atomics_allowed, user_allocated_workspace: 0 }

That failed check in d_vector.hpp is looking at the contents of the padding which is added to device allocations in the test code. If the library code changes the padding by writing beyond the boundaries of its requested allocations, that's the check where it will be caught.

bragadeesh commented 2 years ago

Please note that gfx803 is not officially supported on ROCm. Best we can do at this point is to provide you information (such as above) to help enable you to resolve these issues yourself.

xuhuisheng commented 2 years ago

Thank @cgmb help, I will try to test to this direction.

Thank @bragadeesh noticed, I am very appreciate for your help.

xuhuisheng commented 2 years ago

@cgmb Sorry for the delay response. I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems. Thank you.

cgmb commented 2 years ago

I'm surprised. That's different from the results I had got. I'd honestly given up after trying ROCm 5.0 and discovering my gfx803 card didn't seem to work at all anymore. It was just today that I learned you need to set ROC_ENABLE_PRE_VEGA=true to enable gfx803 in ROCclr now.

In any case, my conclusion from this issue was that the complexity of Tensile limits the ability of the community to understand and debug rocBLAS. I've been asking about potentially having fallback implementations for rocBLAS functions when building without Tensile. That's something that's been desired for a number of different people for different reasons (e.g., https://github.com/spack/spack/pull/28846), and I think it would be helpful for cases like this.

xuhuisheng commented 2 years ago

@cgmb I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

ubuntu-20.04.4
ROCm-5.1.3
RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

LamEnder commented 2 years ago

@cgmb Sorry for the delay response. I ran rocblas-test as you say, on ROCm-4.5.2 and ROCm-5.1.3. There are FAILED with asm codes, but when I deleted r9nao yml, there is no FAILED tests. As mentioned before, my environment is RX580 8G, ubuntu-20.04.4.

So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems. Thank you.

Cool! But is it possible to have a working tensorflow or pytorch installation from the current ROCm 5.1.3 toolchain? I used to spent a whole one day just to get tensorflow complied and it failed spectacularly, that was quite sometimes ago so I cannot provide the log.

I actually have a gfx803 in my gaming laptop, which further complicates the process of getting ROCm to run. If I remember correctly, it was the different memory management model between the iGPU and the dGPU, causing segmentation fault if ROCm tries to access the other GPU with wrong memory management model (e.g: accessing the dGPU with iGPU memory management mode).

I would be grateful if @xuhuiseng can provide me some clues to get this to work.

Currently I have an Manjaro installation so I have to build all the ROCm components from source if I want to install them.

LamEnder commented 2 years ago

@cgmb I do a little more tests using rocBLAS gtest. Here is what I found.

My environment:

* ubuntu-20.04.4

* ROCm-5.1.3

* RX580

Although, rocblas-test said the related rocBLAS version is 2.43.0.f0273f26-dirty, it actually used system installed rocBLAS which is installed to the /opt/rocm. Even if I re-build rocBLAS-clients, if I didn't intall patched rocBLAS using sudo dpkg -i rocblas*, the test will always throw errors.

I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time.

Next point is the error isnot stable, I ran rocblas-test --gtest_filter='*checkin*gemm_medium_f*_r_NN_*', error isnot the same. I try 5 times, sometime, 8_8_8_5_8_8_0_8 report errors, sometime 10_10_10_5_10_10_0_10 reports error. The error message isnot same, sometimes is unit.hpp, sometimes is d_vector.hpp.

And if I ran one test at one time, it run properly. rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f16_r_NN_8_8_8_5_8_8_0_8'

It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much.

The good news is I can reproduce this issue with c codes, dont have to install tensorflow. \^_^/

rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (13923 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (7 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 79
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 72
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (0 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (0 ms)
[----------] 4 tests from _/gemm (14045 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14201 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*
work@6d56e6a8963e:~/ROCm/rocBLAS$
work@6d56e6a8963e:~/ROCm/rocBLAS$

Can you provide me a guide to build tensorflow from source for gfx803?

xuhuisheng commented 2 years ago

@LamEnder Build tensorflow-rocm on ROCm is out of this topic. I create a disscusion on my repository, we can talk about tf there https://github.com/xuhuisheng/rocm-build/discussions/29

xuhuisheng commented 2 years ago

@cgmb Sorry to bother. But I find a intereting thing.

The way to reproduce this issue , It is the simplest way what I found.

/home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_*

It will match 4 tests, the third test always failed. And If I run the third test individual, It can run properly.

work@6d56e6a8963e:~/ROCm/rocBLAS$ /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter='_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*'
rocBLAS version: 2.43.0.f0273f26-dirty

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 580 Series gfx803
with 8.6 GB memory, max. SCLK 1340 MHz, max. MCLK 2000 MHz, compute capability 8.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output appears...

Note: Google Test filter = _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from _/gemm
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 5
lda     : 3
ldb     : 3
h_beta  : 0
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 4 -3 9 -5 9 -1
10 2 9 10 4 7 6 3 7
-350 -480 -200 350 380 290 105 -10 75
-350 -480 -200 350 380 290 105 -10 75
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_5_3_3_0_3 (14142 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 0
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -7 10 -4 10 -5
10 2 9 10 4 7 6 3 7
30 6 27 30 12 21 18 9 21
30 6 27 30 12 21 18 9 21
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_0_3_3_3_3 (3 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 3
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 9 -10 1 -8 4 -10
10 2 9 10 4 7 6 3 7
-40 -90 -13 100 88 79 39 7 36
-40 -90 -13 30 48 7 -80 -107 -45
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
/home/work/ROCm/rocBLAS/clients/gtest/../include/unit.hpp:130: Failure
Expected equality of these values:
  hCPU[i + j * size_t(lda) + k * 0]
    Which is: 30
  hGPU[i + j * size_t(lda) + k * 0]
    Which is: 100
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:  (1 ms)
[ RUN      ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3
transA  : N
transB  : N
M       : 3
N       : 3
K       : 3
h_alpha : 1
lda     : 3
ldb     : 3
h_beta  : 1
ldc     : 3
7 10 1 7 6 3 7 6 7
-9 6 -7 6 -7 3 -2 2 -5
10 2 9 10 4 7 6 3 7
-60 -94 -31 24 40 13 -29 -35 -24
-60 -94 -31 24 40 13 -29 -35 -24
[       OK ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_1_3 (1 ms)
[----------] 4 tests from _/gemm (14351 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (14494 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] _/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3, where GetParam() = { function: "gemm", name: "gemm_medium", category: "pre_checkin", known_bug_platforms: "", alpha: 1.0, alphai: 0.0, beta: 3.0, betai: 0.0, stride_a: 9, stride_b: 9, stride_c: 9, stride_d: 9, stride_x: 0, stride_y: 0, user_allocated_workspace: 0, M: 3, N: 3, K: 3, KL: 128, KU: 128, lda: 3, ldb: 3, ldc: 3, ldd: 3, incx: 0, incy: 0, incd: 0, incb: 0, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: 0, flags: none, a_type: f32_r, b_type: f32_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, initialization: "rand_int", atomics_mode: atomics_allowed, pad: 4096, threads: 0, streams: 0, devices:

 1 FAILED TEST
rocBLAS version: 2.43.0.f0273f26-dirty

command line: /home/work/ROCm/rocBLAS/build/release/clients/staging/rocblas-test --gtest_filter=_/gemm.blas3_tensile/pre_checkin_gemm_medium_f32_r_NN_3_3_3*

It looks like stable, gtest always failed at the third test - pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3. I try to calculate by handle, and dont know where is 100 comes from.

1 * (7 * 7 + 7 * -10 + 7 * 3) + 3 *  10 = 30

And I try to copy the gtest codes to a single cpp source code, since the gtest need parse test-data first, and it costs times, And the problem disappeared. I just dont know what different between gtest and my cpp. Since the log display one by one, I guess we are not need using threads.

Here is my cpp test code: https://github.com/xuhuisheng/rocm-build/blob/master/check/src/test_rocblas3.cpp

cgmb commented 2 years ago

Thanks for the information. It will be a while before I can get to it, but I will take another look.

xuhuisheng commented 2 years ago

@cgmb Sorry to bother again. It is interesting that if I use device_vector to control db, device_vector<float> db(9, 1, false). I can reproduce the issue. If I use float * db, issue disappeared.

with device_vector: https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas3.sh

without device_vector: https://github.com/xuhuisheng/rocm-build/blob/master/check/run-rocblas4.sh

Keep digging.

LamEnder commented 1 year ago

@xuhuisheng @cgmb Since rocBLAS uses Tensile as its backend, I suggest we can generate new kernels by running Tensile benchmarks on gfx803.

Because the newly generated kernels will be built against the new Tensile codebase, I think that would fix some of the rocBLAS test failure I guess.

What do you think?

This wiki link would be useful: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/ P/S: This is just my speculation based on my own digging for both rocBLAS and Tensile codebase, please correct me if I am wrong

cgmb commented 1 year ago

That might help or it might not. I'm not sure.

To address this issue (and for a wide variety of other reasons), I've asked rocBLAS to provide fallback implementations that can work without Tensile. If nothing else, that would mean that rocBLAS will at least be functional on all platforms that have HIP support in the driver, compiler and runtime.

As a rocSOLVER developer, I will also appreciate the ability to build and test rocSOLVER without having to build the (incredibly slow to compile) Tensile kernels. And to build and test rocSOLVER on platforms that Tensile has not yet added support for. And the ability to debug by comparing results with and without Tensile.

So, it will be a useful feature for a number of reasons. I'm not sure when it fits into the schedule, but I put in the feature request at the start of June. It's in the backlog.

LamEnder commented 1 year ago

Yes, and by having a generic, architecture-agnostic implementation, we could also eliminate the pain of dealing with large binary size of rocBLAS.

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

I'm looking forward to it btw.

Anyway I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

cgmb commented 1 year ago

The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not.

That would be nice. I had been considering a similar strategy for rocSOLVER's size-specialized kernels, but it hadn't occurred to me that it would be useful for rocBLAS too.

I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere?

The development backlog is not publicly accessible. Unfortunately, it contains a mix of information that could reasonably be made public and information that must remain private.

wfjsw commented 7 months ago

Since rocBLAS uses Tensile as its backend, I suggest we can generate new kernels by running Tensile benchmarks on gfx803.

I benchmarked new kernel on gfx906 (which exhibits similar issue as this) with no luck on the result.

ROCm / rocBLAS