Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
MIT License
290 stars 66 forks source link

enable_check 1 结果不对 #12

Closed cokeshao closed 2 months ago

cokeshao commented 2 months ago

很棒的工作! 我在测试的时候遇到了问题 ./build/hgemm -gpu_rank 7 -enable_check 1 可以看到Max diff 和 Avg diff 都不是0,这是为什么呢

(cuda_learning) root@dcd8e6348145:~/cuda_hgemm# ./build/hgemm -gpu_rank 7 -enable_check 1 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:55 main] CUDA HGEMM start with 128 CPU processes on the 7-th GPU: NVIDIA GeForce RTX 4090 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:61 main] CUDA driver version / runtime version: 12.2 / 12.1 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:63 main] CUDA capability major/minor version number: 8.9 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:64 main] 128 multiprocessors, 128 CUDA cores/MP: 16384 CUDA cores [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:67 main] GPU max clock rate: 2520 MHz (2.52 GHz) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:69 main] Memory clock rate: 10501 MHz (10.50 GHz) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:71 main] Memory bus width: 384-bit [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:72 main] Total amount of global memory: 24217 MBytes (25393692672 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:74 main] Total amount of constant memory: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:76 main] Total amount of shared memory per block: 48 KBytes (49152 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:78 main] Total shared memory per multiprocessor: 100 KBytes (102400 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:80 main] L2 cache size: 73728 KBytes (75497472 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:82 main] Total number of registers available per block: 65536 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:83 main] Warp size: 32 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:84 main] Max number of threads per multiprocessor: 1536 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:85 main] Max number of threads per block: 1024 [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:86 main] Max dimension size of a thread block (x,y,z): (1024, 1024, 64) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:88 main] Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:91 main] A (512 x 1024) B (1024 x 2048) = C (512 x 2048) [HGEMM 2024-08-20 08:55:19 1371981:1371981 main.cu:92 main] Profiling: enable wmma: 1, enable mma: 1, warmup iterations: 1, profiling iterations: 10, sleep duration: 100 ms, enable check: 1 [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:37 Matrix] Matrix A: 512 1024, cpu: 0x5644e40914a0, gpu: 0x7faa32c00000 [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:37 Matrix] Matrix B: 1024 2048, cpu: 0x7faa56611010, gpu: 0x7faa32e00000 [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:37 Matrix] Matrix C: 512 2048, cpu: 0x5644e4192ff0, gpu: 0x7faa33200000 [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:37 Matrix] Matrix Base: 512 * 2048, cpu: 0x5644e4393c80, gpu: 0x7faa33400000 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:42 Tester] Cublas-Tensor-Op use: 62.584 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Cublas-Tensor-Op ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.570 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.000000, avg diff: 0.000000 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Cublas-Tensor-Op exit, profiling time: 0.024 ms (100.00%), throughput: 89.075 TFLOPS (100.00%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Padding ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 wmma_padding.cu:187 initWmmaPadding] smem_max_size: 68 KBytes (69632 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.881 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Wmma-Padding exit, profiling time: 0.049 ms (205.08%), throughput: 43.433 TFLOPS (48.76%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 wmma_async.cu:198 initWmmaAsync] smem_max_size: 68 KBytes (69632 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.963 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Wmma-Async exit, profiling time: 0.052 ms (214.48%), throughput: 41.530 TFLOPS (46.62%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Pg2s ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 wmma_async_pg2s.cu:273 initWmmaAsyncPg2s] smem_max_size: 68 KBytes (69632 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.712 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Wmma-Async-Pg2s exit, profiling time: 0.039 ms (161.41%), throughput: 55.184 TFLOPS (61.95%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Pg2s-Ps2r ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 wmma_async_pg2s_ps2r.cu:337 initWmmaAsyncPg2sPs2r] smem_max_size: 68 KBytes (69632 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.238 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Wmma-Async-Pg2s-Ps2r exit, profiling time: 0.041 ms (169.91%), throughput: 52.425 TFLOPS (58.85%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Stage2 ----------------- [HGEMM 2024-08-20 08:55:19 1371981:1371981 wmma_async_stage2.cu:366 initWmmaAsyncStage2] smem_max_size: 68 KBytes (69632 Bytes) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.945 ms [HGEMM 2024-08-20 08:55:19 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:121 profile] Wmma-Async-Stage2 exit, profiling time: 0.037 ms (154.66%), throughput: 57.594 TFLOPS (64.66%) [HGEMM 2024-08-20 08:55:19 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Stage3 ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 wmma_async_stage3.cu:444 initWmmaAsyncStage3] smem_max_size: 90 KBytes (92160 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.206 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Wmma-Async-Stage3 exit, profiling time: 0.032 ms (132.44%), throughput: 67.257 TFLOPS (75.51%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Permuted ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_permuted.cu:215 initMmaPermuted] smem_max_size: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.771 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Permuted exit, profiling time: 0.049 ms (202.83%), throughput: 43.917 TFLOPS (49.30%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async.cu:224 initMmaAsync] smem_max_size: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.686 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async exit, profiling time: 0.053 ms (217.96%), throughput: 40.868 TFLOPS (45.88%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async-Pg2s ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async_pg2s.cu:314 initMmaAsyncPg2s] smem_max_size: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.164 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async-Pg2s exit, profiling time: 0.039 ms (161.79%), throughput: 55.057 TFLOPS (61.81%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async-Pg2s-Ps2r ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async_pg2s_ps2r.cu:403 initMmaAsyncPg2sPs2r] smem_max_size: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.141 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async-Pg2s-Ps2r exit, profiling time: 0.038 ms (158.39%), throughput: 56.238 TFLOPS (63.14%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async-Stage2 ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async_stage2.cu:438 initMmaAsyncStage2] smem_max_size: 64 KBytes (65536 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.194 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async-Stage2 exit, profiling time: 0.038 ms (157.02%), throughput: 56.728 TFLOPS (63.69%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async-Stage3 ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async_stage3.cu:533 initMmaAsyncStage3] smem_max_size: 72 KBytes (73728 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 0.910 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async-Stage3 exit, profiling time: 0.031 ms (130.58%), throughput: 68.214 TFLOPS (76.58%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:72 evaluate] ----------------- Evaluating Mma-Async-Stage4 ----------------- [HGEMM 2024-08-20 08:55:20 1371981:1371981 mma_async_stage4.cu:628 initMmaAsyncStage4] smem_max_size: 96 KBytes (98304 Bytes) [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:82 evaluate] Warm up time: 1.209 ms [HGEMM 2024-08-20 08:55:20 1371981:1371981 matrix.h:113 checkValue] Max diff: 0.218750, avg diff: 0.013197 [HGEMM 2024-08-20 08:55:20 1371981:1371981 tester.h:121 profile] Mma-Async-Stage4 exit, profiling time: 0.032 ms (132.47%), throughput: 67.243 TFLOPS (75.49%) [HGEMM 2024-08-20 08:55:20 1371981:1371981 main.cu:128 main] Done

zhiyu-deep commented 2 months ago

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

Bruce-Lee-LY commented 2 months ago

结果有误差是正常的,可以将accumulate换成FP32,精度会更高