AnswerDotAI / gpu.cpp

A lightweight library for portable low-level GPU computation using WebGPU.
https://gpucpp.answer.ai
Apache License 2.0
3.74k stars 177 forks source link

Only Get 0.8Tflops on Example gpu.cpp/examples/matmul, while >= 2.9TFlops is expected #40

Closed ghostplant closed 2 months ago

ghostplant commented 2 months ago

Machine: Macbook Air M2

Theoretical GPU TFlops: 3.6TFlops

Actual TFlops:


[info] Dispatching Kernel version 7, 30 iterations ...
[info] Copying result to CPU
[info] 

Output[0] (4096, 8192)

    75.62   -41.09    44.66   -38.75 ..    77.54  -148.38   118.80    11.59
   -51.14    41.22    63.68   -85.85 ..    10.33    46.63   -37.63   -44.94
    40.48    47.92    -0.86    56.20 ..    28.35    80.12   -62.48   -70.48
    90.25  -125.13   -51.20    64.34 ..   -17.11   -20.05   -58.04    18.76
...
    40.71    83.20  -107.86   -51.57 ..    40.67   -34.96  -117.75   115.25
    -9.16   -35.50   125.30    20.48 ..   -95.57    53.38  -129.10    76.58
     6.63   -66.93    30.76   -35.62 ..    -9.20   -59.73     7.04    19.37
     9.48   -27.52    -9.45   -12.71 ..    33.74   -79.34   -68.20   -22.78

[info] 

================================================================================
Execution Time: (M = 4096, K = 4096, N = 8192) x 30 iterations :
643.3 milliseconds / dispatch ~ 887.28 GFLOPS
================================================================================```
junjihashimoto commented 2 months ago

The memory bandwidth is 100GB/s. It is unified memory for cpu and gpu. 0.8TFlops might be fast enough. It's hard to reach 100% with matmul.

austinvhuang commented 2 months ago

I'll be working on a small module to sweep tiling + workgroup parameters in the near future (have to play with them manually in the meantime) to automate environment-specific tuning. Those parameters haven't been tweaked for an M2 air.

That said 887/3600 ~ 25% theoretical maximum is around the same ratio i'm getting with my M1 (~ 2.5 tflops / 10.4 tflops theoretical max) so it's not far off from what we're seeing. Would be good to see what we get for numpy / pytorch on apple silicon so we have some sense of what's a reasonable performance performance to expect and how much more headroom we have to improve.

ghostplant commented 2 months ago

Using Pytorch MPS, Apple M2 can get 2.9TFlops on the same GEMM size

austinvhuang commented 2 months ago

Good to know - we have more room for improvement on that example then. One reference implementation we should check is the tensorflow js wgsl implementation which is relatively optimized.

junjihashimoto commented 2 months ago

It is indeed a compute bound, so it may be possible to reach more than 2.9 TFlops. Thx! https://jax.readthedocs.io/en/latest/pallas/tpu/matmul.html

def matmul_flops(m, k, n):
  return 2 * m * k * n

def matmul_membw(m, k, n):
  return (m * k + k * n + m * n) * 4

def matmul_flops_intensity(m, k, n):
  flops = matmul_flops(m, k, n)
  membw = matmul_membw(m, k, n)
  return flops / membw

# 3.6TFlops
m2_flops = 3.6e12
# 100GB/s
m2_membw = 100e9
# flops / byte
m2_op_intensity = m2_flops / m2_membw

print(f"m2_op_intensity: {m2_op_intensity} flops/byte")
#m2_op_intensity: 36.0 flops/byte

print(f"matmul_op_intensity: {matmul_flops_intensity(4096, 4096, 8192)} flops/byte")
#matmul_op_intensity: 819.2 flops/byte

# m2_op_intensity(36.0 flops/byte) is less than matmul_op_intensity(819.2 flops/byte).
# It is compute bound!
junjihashimoto commented 2 months ago

Where is the matmul kernel of mps? This one is the naive kernel. (It's not fast.) https://github.com/pytorch/pytorch/blob/eca0cb0fbe84bb0a34fa94afe261bceecd52c436/aten/src/ATen/native/mps/operations/LinearAlgebra.mm#L32-L81

ghostplant commented 2 months ago

I get 2.9TFlops using this Torch code:

#!/usr/bin/env python3

import os, sys
import argparse
import torch
import time

X = torch.arange(4096 * 4096, dtype=torch.float32).view([4096, 4096]).to('mps')
Y = torch.arange(4096 * 4096, dtype=torch.float32).view([4096, 4096]).to('mps')

def wait():
    torch.mps.synchronize()
    return time.perf_counter()

torch.matmul(X, Y)
torch.matmul(X, Y)
torch.matmul(X, Y)

t0 = wait()
for i in range(10):
  torch.matmul(X, Y)
t1 = wait()

cost = (t1 - t0) / 10

print('TFlops:', 4096 * 4096 * 4096 * 2 / cost * 1e-12)
junjihashimoto commented 2 months ago

@ghostplant Thx! It is float32.

junjihashimoto commented 2 months ago

I checked the tests with arange and randn. I will look into why the later version is downgraded.

The test with arange gets 5.7 TFlops on m2 pro. The test with randn also gets 5.7 TFlops on m2 pro.

The test with randn gets 1.9TFlops on m2 pro.

$ cat test_arange.py
#!/usr/bin/env python3

import os, sys
import argparse
import torch
import time

X = torch.arange(4096 * 4096, dtype=torch.float32).view([4096, 4096]).to('mps')
Y = torch.arange(4096 * 8192, dtype=torch.float32).view([4096, 8192]).to('mps')

def wait():
    torch.mps.synchronize()
    return time.perf_counter()

torch.matmul(X, Y)
torch.matmul(X, Y)
torch.matmul(X, Y)

t0 = wait()
for i in range(10):
  torch.matmul(X, Y)
t1 = wait()

cost = (t1 - t0) / 10
$ python test_arange.py
TFlops: 5.726274617567384
$ cat test_randn.py
#!/usr/bin/env python3

import os, sys
import argparse
import torch
import time

X = torch.randn((4096, 4096), requires_grad=False, dtype=torch.float32).to("mps")
Y = torch.randn((4096, 8192), requires_grad=False, dtype=torch.float32).to("mps")
Z = torch.randn((4096, 8192), requires_grad=False, dtype=torch.float32).to("mps")

def wait():
    torch.mps.synchronize()
    return time.perf_counter()

with torch.no_grad():
  torch.matmul(X, Y)
  torch.matmul(X, Y)
  torch.matmul(X, Y)

  t0 = wait()
  niter = 30
  for i in range(niter):
    Z=torch.matmul(X, Y)
  t1 = wait()

  cost = (t1 - t0) / niter

  print('TFlops:', 4096 * 4096 * 8192 * 2 / cost * 1e-12)
$ python test_randn.py
TFlops: 5.760410971906142
ghostplant commented 2 months ago

They should have some perf. I found you should / 10 instead of / 30 in the second source code.

junjihashimoto commented 2 months ago

@ghostplant Thank you! I have updated the results above. Both are the same.

austinvhuang commented 2 months ago

Data dependence of matmul flops has been documented:

https://www.thonking.ai/p/strangely-matrix-multiplications

This might be the first time i've seen the behavior discussed for apple silicon though (in the 30 iteration case).

austinvhuang commented 2 months ago

We'll tackle + track this under @junjihashimoto 's ongoing work on task 14 here: https://github.com/orgs/AnswerDotAI/projects/5/

Closing for now but feel free to post follow-ups here or discuss further in discord https://discord.gg/zmJVhXsC7f

junjihashimoto commented 2 months ago

Pytorch's matmul is matrixMultiplicationWithPrimaryTensor. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/matrixmultiplication(primary:secondary:name:)?changes=_1_7&language=objc https://github.com/pytorch/pytorch/blob/eca0cb0fbe84bb0a34fa94afe261bceecd52c436/aten/src/ATen/native/mps/operations/LinearAlgebra.mm#L120 It is not OSS.