cpy op's block size is too small on cuda backend

bssrdf commented 1 month ago

The thread block size CUDA_CPY_BLOCK_SIZE, currently set to 32, is too small for certain GPUs, e.g. 4090. Nsight Compute reports "The 6.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the hardware maximum of 12. This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that can fit on the SM." and "On average, each warp of this kernel spends 2.7 cycles being stalled waiting on a fixed latency execution dependency. Typically, this stall reason should be very low and only shows up as a top contributor in already highly optimized kernels. Try to hide the corresponding instruction latencies by increasing the number of active warps, restructuring the code or unrolling loops. Furthermore, consider switching to lower-latency instructions, e.g. by making use of fast math compiler options. This stall type represents about 34.5% of the total average of 7.9 cycles between issuing two instructions."

Increasing CUDA_CPY_BLOCK_SIZE to 64 or 96 solved both problems and gave a small performance boost. I am not sure if it should be done only to certain SM architectures and I can only test on 4090. Screenshot 2024-10-22 085051

JohannesGaessler commented 1 month ago

It's absolutely true that the CPY kernels have poor memory access patterns; they are essentially just ports of the CPU code with minimal changes. Because they are also needed for copies from scalar to quantized data there needs to be intra-warp communication between threads though. It should not be difficult to improve the current code; I have not yet put any effort towards it since I expect the gains to be minimal but I will happily review any related PRs.

bssrdf commented 1 month ago

Please review https://github.com/ggerganov/ggml/pull/996. Thanks.

ggerganov / ggml

cpy op's block size is too small on cuda backend #995