Open sleepwalker2017 opened 3 days ago
Hi @sleepwalker2017 , thanks for doing the benchmark!
use_tensor_cores=True
invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.cuobjdump -sass *.o
), and I can confirm there is NO HMMA
instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.
Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.
I use q_head = kv_head = 40, which is the standard attention for llama 13B.
I tried use_tensor_cores = True and False, I get nearly the same performance.
My question is:
Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?
I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?