Open JohannesGaessler opened 1 month ago
It is not clear to me how to calculate the memory bandwidth of a matrix multiplication, so I settled for counting all the memory accesses that are typically necessary for a $O(N^3)$ matrix multiplication. This does not take into account the cache effects, so it can result in very high bandwidth numbers being reported. To adjust this, test_mul_mat::op_size
would need to be changed to report the amount of memory accessed by the operation.
I think the correct way to calculate the memory bandwidth for any operation is to simply sum up the input and output sizes and then divide the sum by the total runtime. For matrix multiplications using large matrices the effective memory bandwidth will be low but you are going to be compute bound in that case anyways; I think it would then make more sense to calculate the FLOPS.
I don't have a strong opinion about this either way, if you think it would be more useful to calculate this in other way feel free to open a PR to change it. Summing the inputs and output sizes is what the default implementation of op_size
does, but it is overridden in the mul_mat operation.
I noticed that for the CUDA backend using an RTX 3090 the reported achieved memory bandwidth for matrix multiplication can be much greater than 936 GB/s (the maximum of the hardware). Therefore, there must be some bug with how these numbers are calculated.