exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
15.01k stars 517 forks source link

@par(gpu=True) syntax does not support (grid=?, block=?) options #466

Closed marioroy closed 1 year ago

marioroy commented 1 year ago

I compared @par(gpu=True) vs @gpu.kernel. The demonstrations live in the examples folder.

https://github.com/marioroy/mce-sandbox

NVIDIA GeForce RTX 3070 Results:  @par(gpu=True)  @gpu.kernel       NPrimes
  pgpusieve 1e+9  . . . . . . . . . .   0.191s       0.047s      50,847,534
  pgpusieve 1e+10 . . . . . . . . . .   2.644s       0.553s     455,052,511
  pgpusieve 1e+11 . . . . . . . . . .  27.962s       9.469s   4,118,054,813
  pgpusieve 1e+12 1.1e+12 . . . . . .  32.379s      15.183s   3,612,791,400
  pgpusieve 1e+13 1.01e+13  . . . . .  33.743s      17.968s   3,340,141,707
  pgpusieve 1e+14 1.001e+14 . . . . .  31.274s      21.981s   3,102,063,927
  pgpusieve 1e+15 1.0001e+15  . . . .  30.814s      24.549s   2,895,317,534
  pgpusieve 1e+16 1.00001e+16 . . . .  35.010s      27.558s   2,714,336,584
  pgpusieve 1e+17 1.000001e+17  . . .  57.573s      38.371s   2,554,712,095
  pgpusieve 1e+18 1.0000001e+18 . . . 124.274s      68.059s   2,412,731,214
                pgpusieve.codon ----------|            |
                 gpusieve.codon -----------------------|

The pgpusieve.codon example configures the same step size as gpusieve.codon. The difference is unable to tell Codon the desired grid and block options.

# TODO: The 'bsize' and 'gsize' values are not used.
# @par(gpu=True) syntax does not support (grid=?, block=?) options.

@par(gpu=True, collapse=1)
for n in range(num_segments):
    ...
marioroy commented 1 year ago

I did some testing by adding the following lines at the top of gpu.codon. This file is located in the Codon installation path e.g. .../install/lib/codon/stdlib/. It turns out the fixed block size is the reason for the slowness compared to running @gpu.kernel.

_GRID_SIZE, _BLOCK_SIZE = 0, 0

def set_grid_size(size):
    global _GRID_SIZE
    _GRID_SIZE = size

def set_block_size(size):
    global _BLOCK_SIZE
    _BLOCK_SIZE = size

In the same file, I changed two lines inside the outline template function (near the end of the file).

def _gpu_loop_outline_template(start, stop, args, instance: Static[int]):
    ...
    MAX_BLOCK = _BLOCK_SIZE if _BLOCK_SIZE else 1024
    MAX_GRID = _GRID_SIZE if _GRID_SIZE else 2147483647
    ...

Finally, I added two lines to my application. Be sure to import gpu.

gpu.set_grid_dim(gsize)
gpu.set_block_dim(bsize)

Before and after results:

$ pgpusieve 1e9
    before  0.191s
     after  0.054s

$ pgpusieve 1e10
    before  2.644s
     after  0.668s

$ pgpusieve 1e11
    before 27.962s
     after 11.047s
marioroy commented 1 year ago

This is not a bug. I understand the reason why @par(gpu=True) syntax may run slower.