ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.87k stars 9.3k forks source link

test-backend-ops performance numbers incorrect #8898

Open JohannesGaessler opened 1 month ago

JohannesGaessler commented 1 month ago

I noticed that for the CUDA backend using an RTX 3090 the reported achieved memory bandwidth for matrix multiplication can be much greater than 936 GB/s (the maximum of the hardware). Therefore, there must be some bug with how these numbers are calculated.

slaren commented 1 month ago

It is not clear to me how to calculate the memory bandwidth of a matrix multiplication, so I settled for counting all the memory accesses that are typically necessary for a $O(N^3)$ matrix multiplication. This does not take into account the cache effects, so it can result in very high bandwidth numbers being reported. To adjust this, test_mul_mat::op_size would need to be changed to report the amount of memory accessed by the operation.

JohannesGaessler commented 1 month ago

I think the correct way to calculate the memory bandwidth for any operation is to simply sum up the input and output sizes and then divide the sum by the total runtime. For matrix multiplications using large matrices the effective memory bandwidth will be low but you are going to be compute bound in that case anyways; I think it would then make more sense to calculate the FLOPS.

slaren commented 1 month ago

I don't have a strong opinion about this either way, if you think it would be more useful to calculate this in other way feel free to open a PR to change it. Summing the inputs and output sizes is what the default implementation of op_size does, but it is overridden in the mul_mat operation.