hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
WaveSplitK is a feature that enable threads within a wave to compute the same elements (in matrix D). In a group of WaveSplitK threads, each will compute their local sum for different summantion index, and then use shuffling to reduce the results.
Currently using InnerUnroll to enable wider local read. For example, LocalReadVectorWidth = 2 with InnerUnroll = 2 will result in reading 4 elements in one instruction.
Support dot2 fp16 (HPA) mac kernel for gfx942.
tox test passed on gfx942