hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Re-enable environment variable that can be used to test the number of stream-k tiles that get run in a stream-k kernel.
The default (in 2-tile algorithm) is to run all remainder tiles + 1 extra tile per workgroup. This ensures cases like remainder=1 are not split across a large number of workgroups. But for cases when remainder tiles is large (ie: remainder = WGs - 1), it is generally faster to run only the remainder tiles rather than remainder + #WGs. This way it will run more tiles in data-parallel and improve memory alignment.
This change adds an environment variable that allows the user to override the number of stream-k tiles selected. I developed a prediction model to select the best setting automatically, but that still needs to be ported from Tensile in a future change (this PR is just a first step to allow some experiments to run).
TENSILE_STREAMK_FULL_TILES=0 (runs remainder only)
TENSILE_STREAMK_FULL_TILES=1 (default, runs remainder + #WGs)
TENSILE_STREAMK_FULL_TILES=[large number] (can be used to make a 2-tile kernel double as basic algorithm, ie StreamK=1)
Re-enable environment variable that can be used to test the number of stream-k tiles that get run in a stream-k kernel.
The default (in 2-tile algorithm) is to run all remainder tiles + 1 extra tile per workgroup. This ensures cases like remainder=1 are not split across a large number of workgroups. But for cases when remainder tiles is large (ie: remainder = WGs - 1), it is generally faster to run only the remainder tiles rather than remainder + #WGs. This way it will run more tiles in data-parallel and improve memory alignment.
This change adds an environment variable that allows the user to override the number of stream-k tiles selected. I developed a prediction model to select the best setting automatically, but that still needs to be ported from Tensile in a future change (this PR is just a first step to allow some experiments to run).
TENSILE_STREAMK_FULL_TILES=0 (runs remainder only) TENSILE_STREAMK_FULL_TILES=1 (default, runs remainder + #WGs) TENSILE_STREAMK_FULL_TILES=[large number] (can be used to make a 2-tile kernel double as basic algorithm, ie StreamK=1)