Optimize temp vgpr allocation for ClusterLocalRead

ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.

MIT License

218 stars 147 forks source link

Closed nakajee closed 9 months ago

nakajee commented 9 months ago

nakajee commented 9 months ago

This doesn't apply to f16 or bf16?

vgprPackTemp is not used for f16/bf16. This extra vgpr is necessary only for 8bit data packing. No need to allocate this for f16/bf16.

nakajee commented 9 months ago

Previously, this was allocated in f16 case, but it was just wasting a vpgr.