flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.27k stars 115 forks source link

Question about `use_tensor_cores = True or False` #520

Open sleepwalker2017 opened 3 days ago

sleepwalker2017 commented 3 days ago

Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.

I use q_head = kv_head = 40, which is the standard attention for llama 13B.

I tried use_tensor_cores = True and False, I get nearly the same performance.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49025722         40  1225643.1  1222881.5   1219938   1231234       4342.4  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           374625          1   374625.0   374625.0    374625    374625          0.0  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…
 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49189726         40  1229743.2  1229681.5   1228833   1231042        495.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           379936          1   379936.0   379936.0    379936    379936          0.0  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
      0.0             1600          1     1600.0     1600.0      1600      1600          0.0  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…

My question is:

  1. Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?

  2. I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?

  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         0.11
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst       166688
    ---------------------------------------------------------------------- ----------- ------------
  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         3.60
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst      4899840
    ---------------------------------------------------------------------- ----------- ------------
yzh119 commented 3 days ago

Hi @sleepwalker2017 , thanks for doing the benchmark!

  1. Yes, use_tensor_cores=True invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.
  2. That's interesting! I didn't manually call tensor cores in flashinfer decode kernels but I'm not sure if nvcc can do some clever optimizations that turns some reduction to tensor cores, so I checked the sass code of decode kernels (you can check sass by cuobjdump -sass *.o), and I can confirm there is NO HMMA instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.