exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
13.96k stars 498 forks source link

136x slower than Numba #224

Open pauljurczak opened 1 year ago

pauljurczak commented 1 year ago

I installed Codon 0.15.5 and codon-jit 0.1.3 on Ubuntu 22.04.2 with Python 3.10.6. Comparing this script with Numba variant (commented line):

import codon
import numba as nb
import numpy as np
import timeit as ti
from math import atan2, sqrt

@codon.jit(pyvars=['atan2', 'sqrt'])
# @nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def getGradAngle(im, grad, angle, w):
  for i in range(im.size-w-1):
    dx = im[i+w+1]-im[i]
    dy = im[i+w]-im[i+1]
    grad[i] = sqrt(dx**2+dy**2)
    angle[i] = atan2(dy, dx)

w, h = 640, 480
im = np.random.rand(w*h).astype('f4')
grad = np.zeros_like(im)
angle = np.zeros_like(im)

fun = f'getGradAngle(im, grad, angle, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms  {np.sum(grad)}')

I'm getting Codon variant output:

getGradAngle(im, grad, angle, w):  1139.133ms  1145.087ms  159771.875

and Numba variant output:

getGradAngle(im, grad, angle, w):   8.377ms   8.389ms  159693.890625

Plain Python variant output:

getGradAngle(im, grad, angle, w):  1129.810ms  1135.429ms  159641.578125

Why is the Codon variant slower than even the plain Python?

dbritto-dev commented 1 year ago

Would you mind trying without pyvars?

pauljurczak commented 1 year ago

It will not compile without pyvars:

codon.codon_jit.JITError: /home/paul/st-python/bench/gh-2.py:15:15: name 'sqrt' is not defined
/home/paul/st-python/bench/gh-2.py:16:16: name 'atan2' is not defined
arshajii commented 1 year ago

Hi @pauljurczak -- Codon doesn't support NumPy yet (we're working on a Codon-native NumPy that's fully compiled), so that function is just operating on the Python objects within Codon, meaning there won't be any performance improvement.

FYI you can also avoid the pyvars by importing the math functions inside the @codon.jit function (the JIT'd functions are compiled in their own environment so they don't see external imports).

pauljurczak commented 1 year ago

importing the math functions inside the @codon.jit function

I did that:

@codon.jit
def getGradAngle(im, grad, angle, w):
  from math import atan2, sqrt

  for i in range(im.size-w-1):
    dx = im[i+w+1]-im[i]
    dy = im[i+w]-im[i+1]
    grad[i] = sqrt(dx**2+dy**2)
    angle[i] = atan2(dy, dx)

Performance improved just a bit, but is still poor, i.e. 127x slower than Numba:

getGradAngle(im, grad, angle, w):  1064.479ms  1070.543ms  159669.859375
arshajii commented 1 year ago

Yes, again this is expected at the moment until we add Codon-native NumPy. The NumPy arrays are being passed to the function as Python objects since there's no ndarray type in Codon, so the JIT'd code is just using the same CPython API calls that Python is using under the hood, leading to the same performance.

One possible workaround in the meantime is to use lists instead. The discussion here might also be of interest: https://github.com/exaloop/codon/discussions/228.

marioroy commented 5 months ago

One possible workaround in the meantime is to use lists instead.

I took @pauljurczak's example and made a class. How can Codon return the sum of the self.grad list? 0.0 is incorrect.

import codon
import numpy as np
import timeit as ti

@codon.convert
class Foo:
  __slots__ = 'w', 'h', 'im', 'grad', 'angle'

  def __init__(self, w, h):
    im = np.random.rand(w*h).astype('f4')
    self.w = w
    self.h = h
    self.im = im.tolist()
    self.grad = np.zeros_like(im).tolist()
    self.angle = np.zeros_like(im).tolist()

  @codon.jit
  def getGradAngle(self):
    from math import atan2, sqrt

    for i in range(len(self.im)-self.w-1):
      dx = self.im[i+self.w+1]-self.im[i]
      dy = self.im[i+self.w]-self.im[i+1]
      self.grad[i] = sqrt(dx**2+dy**2)
      self.angle[i] = atan2(dy, dx)

  @codon.jit
  def getSum(self) -> float:
    return sum(self.grad)

foo = Foo(640, 480)
fun = f'foo.getGradAngle()'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms  {foo.getSum()}')

Running:

# Python (@codon.jit lines commented out)
time python demo.py
foo.getGradAngle():  82.440ms  82.886ms  159764.3314357752

real  0m3.247s
user  0m3.140s
sys   0m0.101s

# codon.jit
time python demo.py
foo.getGradAngle():  16.248ms  16.366ms  0.0

real  0m2.516s
user  0m2.510s
sys   0m0.131s

The getSum method is also jitted and thought this would work. I'm running develop branch 725003c.

marioroy commented 5 months ago

Next, I tried making a demo.codon demonstration.

from python import numpy as np
from math import atan2, sqrt
from time import time

@tuple
class Foo:
  w: int
  h: int
  im: List[float]
  grad: List[float]
  angle: List[float]

  def __new__(w: int, h: int):
    im = np.random.rand(w*h).astype('f4')
    grad = np.zeros_like(im)
    angle = np.zeros_like(im)
    return Foo(w, h, im.tolist(), grad.tolist(), angle.tolist())

  def getGradAngle(self):
    for i in range(len(self.im)-self.w-1):
      dx = self.im[i+self.w+1]-self.im[i]
      dy = self.im[i+self.w]-self.im[i+1]
      self.grad[i] = sqrt(dx**2+dy**2)
      self.angle[i] = atan2(dy, dx)

  def getSum(self) -> float:
    return sum(self.grad)

foo = Foo(640, 480)
repeat = 10
t0 = time()

for i in range(repeat):
  foo.getGradAngle()

t1 = time()
print(f"foo.getGradAngle():  {(t1-t0)/repeat*1000:6.3f}ms  {foo.getSum():12.5f}")

Running:

# Set library path to a Python distribution containing NumPy
export CODON_PYTHON=~/miniconda3/envs/mandel/lib/libpython3.so

# Run
time ./codon run demo.codon 
foo.getGradAngle():  28.023ms  159758.66082

real  0m2.118s
user  0m2.009s
sys   0m0.113s

# Build a release binary (faster)
codon build -release demo.codon
time ./demo
foo.getGradAngle():   6.639ms  159908.73775

real  0m0.235s
user  0m0.154s
sys   0m0.086s
marioroy commented 5 months ago

Finally, I tried the Numba demonstration by @pauljurczak with cache=True.

import numba as nb
import numpy as np
from math import atan2, sqrt
from time import time

@nb.njit(fastmath=True, locals=dict(w=nb.uint32), cache=True)
def getGradAngle(im, grad, angle, w):
  for i in range(im.size-w-1):
    dx = im[i+w+1]-im[i]
    dy = im[i+w]-im[i+1]
    grad[i] = sqrt(dx**2+dy**2)
    angle[i] = atan2(dy, dx)

w, h = 640, 480
im = np.random.rand(w*h).astype('f4')
grad = np.zeros_like(im)
angle = np.zeros_like(im)

fun = f'getGradAngle(im, grad, angle, w)'
repeat = 10
t0 = time()

for i in range(repeat):
  getGradAngle(im, grad, angle, w)

t1 = time()
print(f"{fun}:  {(t1-t0)/repeat*1000:6.3f}ms  {np.sum(grad):12.5f}")

Running:

rm -fr __pycache__

# First run
time python numba_demo.py 
getGradAngle(im, grad, angle, w):  20.154ms  160076.46875

real  0m0.407s
user  0m0.369s
sys   0m0.036s

# 2nd run using jitted cache object
time python numba_demo.py 
getGradAngle(im, grad, angle, w):   8.302ms  160129.25000

real  0m0.282s
user  0m0.246s
sys   0m0.035s