Closed Hzfengsy closed 1 week ago
This PR adds simdgroup intrinsic and tensorization support in DLight rule for Metal backends.
Perf comparison with Kernel from Llama-3 8B
Before
batch_size: 16 Time (ms): 3.42 Total Bytes (MB): 225.00 Memory (GB/s): 64.17 GFLOP/s: 1022.11 batch_size: 32 Time (ms): 4.69 Total Bytes (MB): 226.00 Memory (GB/s): 47.07 GFLOP/s: 1492.82 batch_size: 64 Time (ms): 8.92 Total Bytes (MB): 228.00 Memory (GB/s): 24.96 GFLOP/s: 1569.52
After
batch_size: 16 Time (ms): 1.69 Total Bytes (MB): 225.00 Memory (GB/s): 130.40 GFLOP/s: 2077.10 batch_size: 32 Time (ms): 2.05 Total Bytes (MB): 226.00 Memory (GB/s): 107.78 GFLOP/s: 3418.39 batch_size: 64 Time (ms): 3.96 Total Bytes (MB): 228.00 Memory (GB/s): 56.17 GFLOP/s: 3531.63
This PR adds simdgroup intrinsic and tensorization support in DLight rule for Metal backends.
Perf comparison with Kernel from Llama-3 8B
Before
After