PaddleJitLab / CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.
Apache License 2.0
86 stars 16 forks source link

[Doc] Add Reduce Optimize Method: Unroll Strategy #16

Closed AndSonder closed 5 months ago

AndSonder commented 5 months ago

给 Reduce Kernel 添加 Unroll 策略

优化手段 运行时间(us) 带宽(GB/s) 加速比
Baseline 3118.4 42.503 ~
交错寻址 1904.4 73.522 1.64
解决 bank conflict 1475.2 97.536 2.29
去除 idle 线程 758.38 189.78 4.11
展开最后一个 Warp 484.01 287.25 6.44
完全展开 477.23 291.77 6.53