Optimize ds read waitcnt for 1st iteration.

ROCm / hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html

MIT License

49 stars 80 forks source link

Optimize ds read waitcnt for 1st iteration. #1132

Closed hcman2 closed 1 week ago

hcman2 commented 1 week ago

Originally, we will always wait all of the PLR to be done in 1st iteration. This optimization separates it into 2 waitcnt. It will release the pressure if the kernel is blocked by the 1st waitcnt.

hcman2 commented 1 week ago

This is just an example.

hcman2 commented 1 week ago

CI gfx94x Ubuntu pass.

hcman2 commented 1 week ago

---- generated xml file: /meng/hcman/hipBLASLt/tensilelite/python_tests.xml ---- ========== 23 passed, 83 skipped, 372 warnings in 2019.89s (0:33:39) =========== local gfx90a passed