ROCm / hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html
MIT License
49 stars 80 forks source link

enable LSU + Int8 #875

Closed nakajee closed 2 months ago

nakajee commented 3 months ago
nakajee commented 3 months ago

I could not reproduce CI test fail (lsu.yaml) on my local environment. The log showed that the error was related to GSU post code for I8, but I did not enable GSU for I8 in lsu.yaml. I moved LSU+I8 test cases in a new file (lsu_i8.yaml) and added GSU cases to see if it fails again or not.

hcman2 commented 3 months ago

https://ontrack-internal.amd.com/browse/SWDEV-454948 Looks like there exist a compiler bug for I8II cases. The I8I8I is OK. You may separate the I8II test yaml and set it "xfail" first. Compiler is targeting a fix in rocm6.2 rc4 version.

nakajee commented 3 months ago

I added back I8 LSU change and commented out I8II test case. Hopefully, everything is OK with this change.

nakajee commented 3 months ago

2 tests failed. FAILED Tensile/Tests/common/test_config.py::test_config[Tensile/Tests/common/groupedgemm/grouped_gemm_userargs.yaml] - SystemExit: 1 FAILED Tensile/Tests/common/test_config.py::test_config[Tensile/Tests/common/sparse/spmm_i8_sb.yaml] - SystemExit: 1

I could not reproduce the fail on my local. I checked the detail of the fail. Result of device is 0. I saw similar random fails in Tensile CI before. I suspect some memory copy issue, but not completely solved in Tensile yet. It will not occur again when we re-run the CI test. I will re-run CI test.

nakajee commented 3 months ago

OK. Finally gfx942 precheckin test passed including lsu_i8. (ignoring known issue with rhel9)