How was the data in the blog measured?

cloudhan commented 5 months ago

In the 2024-02-02 blog post, for example

I tried to repro it simply with ncu data for numseq 1 and seqlen 16384 on 4090:

  void vllm::paged_attention_v2_kernel<unsigned short, (int)128, (int)16, (int)128, (int)512>(float *, float *, T1 *, const T1 *, const T1 *, const T1 *, int, float, const int *, const int *, int, const float *, int, int, int) (32, 1, 32)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond        10.24
    SM Frequency            cycle/nsecond         2.23
    Elapsed Cycles                  cycle       608178
    Memory Throughput                   %        94.59
    DRAM Throughput                     %        94.59
    Duration                      usecond       272.16
...

  void vllm::paged_attention_v2_reduce_kernel<unsigned short, (int)128, (int)128, (int)512>(T1 *, const float *, const float *, const T1 *, const int *, int) (32, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond        10.11
    SM Frequency            cycle/nsecond         2.20
    Elapsed Cycles                  cycle        10792
    Memory Throughput                   %         5.89
    DRAM Throughput                     %         5.89
    Duration                      usecond         4.90
...

It is definitely as low as 70%-ish, could you please share more details about the measurement, or better the benchmark code. Are you measuring the timing with events?

cloudhan commented 5 months ago

For more reference, seqlen 32768

  void vllm::paged_attention_v2_kernel<unsigned short, (int)128, (int)16, (int)128, (int)512>(float *, float *, T1 *, const T1 *, const T1 *, const T1 *, int, float, const int *, const int *, int, const float *, int, int, int) (32, 1, 64)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond        10.24
    SM Frequency            cycle/nsecond         2.23
    Elapsed Cycles                  cycle      1196860
    Memory Throughput                   %        95.66
    DRAM Throughput                     %        95.66
    Duration                      usecond       535.52
...

  void vllm::paged_attention_v2_reduce_kernel<unsigned short, (int)128, (int)128, (int)512>(T1 *, const float *, const float *, const T1 *, const int *, int) (32, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond        10.15
    SM Frequency            cycle/nsecond         2.21
    Elapsed Cycles                  cycle        12643
    Memory Throughput                   %        10.78
    DRAM Throughput                     %        10.78
    Duration                      usecond         5.73

yzh119 commented 5 months ago

@cloudhan we are using nvbench for all benchmarks.

yzh119 commented 5 months ago

Flashinfer benchmark codes are available at: https://github.com/flashinfer-ai/flashinfer/tree/main/src

You can compile them by yourself.

mkdir build
cp cmake/config.cmake build/
cd build
cmake ..
make -j$(nproc)

yzh119 commented 5 months ago

And the memory throughput reported by ncu (I believe you are using ncu) is different from the metric throughput utilization we are using.

cloudhan commented 5 months ago

@yzh119 Thanks for the quick reply. It is a little bit clearer now. Is there any branch that host the vllm benchmark part of code? Current main only has flashinfer code, I don't want to replicate it myself in case something goes wrong due to intricate natrue of every benchmark ;)

flashinfer-ai / flashinfer

How was the data in the blog measured? #188