NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.59k stars 975 forks source link

wgmma.mma_async instructions are serialized due to non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage #2116

Open LuoYuanke opened 2 months ago

LuoYuanke commented 2 months ago

Hi I got the warning when I compiling ptx code with command ptxas --gpu-name sm_90a -O0 wgmma_rs.ptx. Could someone help to explain what does it means and how to improve the ptx code to eliminate the warning?

Duplication steps:

Run ptxas --gpu-name sm_90a -O0 wgmma_rs.ptx and got below warning

ptxas info : (C7515) Potential Performance Loss: wgmma.mma_async instructions are serialized due to non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage in the function 'selp_b16'

The source code is as below.

.version 8.4
.target sm_90a
.address_size 64

.visible .entry selp_b16 (
  .param .b16 param_0,
  .param .b16 param_1,
  .param .u32 param_2
)
{
.reg .f16x2 %f16a<40>, %f16d<40>;
.reg .f32   %f32d<40>;
.reg .b64   %descA, %descB;
.reg .pred  %scaleD;

wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16
  {%f32d0, %f32d1, %f32d2, %f32d3},
  {%f16a0, %f16a1, %f16a2, %f16a3},
  %descB,
  1, -1, -1, 1;

// mov.f32 %f32d0, 0f00000000;
// mov.b32 %f16a0, 128;
// mov.b32 %f16a1, 64;
// mov.b32 %f16a2, 32;
// mov.b32 %f16a3, 16;

wgmma.mma_async.sync.aligned.m64n72k16.f16.f16.f16
  {%f16d0, %f16d1,  %f16d2,  %f16d3,  %f16d4,  %f16d5,  %f16d6,  %f16d7,  %f16d8,
   %f16d9, %f16d10, %f16d11, %f16d12, %f16d13, %f16d14, %f16d15, %f16d16, %f16d17},
  %descA,
  %descB,
  %scaleD, -1, 1, 1, 0;
}
yuhengxnv commented 2 months ago

@LuoYuanke Not familiar with ptx, but I found that the warnings are gone with '-O2' instead of '-O0'.

LuoYuanke commented 2 months ago

Yes, it seems when compiling with -O0, compiler would generate "WARPGROUP.DEPBAR.LE gsb0, 0x0" to wait for wgmma finish. I think that's why compiler emit warning that the wgmma instructions are serialized.