ROCm / hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html
MIT License
63 stars 89 forks source link

Enable test option for stream-k full tile or remainder tile #1249

Closed AlexBrownAMD closed 3 weeks ago

AlexBrownAMD commented 1 month ago

Re-enable environment variable that can be used to test the number of stream-k tiles that get run in a stream-k kernel.

The default (in 2-tile algorithm) is to run all remainder tiles + 1 extra tile per workgroup. This ensures cases like remainder=1 are not split across a large number of workgroups. But for cases when remainder tiles is large (ie: remainder = WGs - 1), it is generally faster to run only the remainder tiles rather than remainder + #WGs. This way it will run more tiles in data-parallel and improve memory alignment.

This change adds an environment variable that allows the user to override the number of stream-k tiles selected. I developed a prediction model to select the best setting automatically, but that still needs to be ported from Tensile in a future change (this PR is just a first step to allow some experiments to run).

TENSILE_STREAMK_FULL_TILES=0 (runs remainder only) TENSILE_STREAMK_FULL_TILES=1 (default, runs remainder + #WGs) TENSILE_STREAMK_FULL_TILES=[large number] (can be used to make a 2-tile kernel double as basic algorithm, ie StreamK=1)

AlexBrownAMD commented 1 month ago

should we modify the description of TENSILE_STREAMK_FULL_TILES in Common.py?

Good call, just posted an update with how it currently works in hipBLASLt repo.