HigherOrderCO / Bend

A massively parallel, high-level programming language
https://higherorderco.com
Apache License 2.0
17.25k stars 426 forks source link

CUDA vs Bend performance comparison #557

Open deadsoul44 opened 3 months ago

deadsoul44 commented 3 months ago

Hello, thanks for the great work.

Speed-up between CPU and GPU is obviously great but what about speed difference between CUDA and Bend for the same algorithm? A comparison using a tensor operation like a neural network library can be great.

koyomitan3 commented 3 months ago

You can use a some snippet from the examples (from the README.md in case) the official repo https://github.com/HigherOrderCO/Bend?tab=readme-ov-file#getting-started


def Sum(start, target):
  if start == target:
    return start
  else:
    return start + Sum(start + 1, target)  

def main():
  return Sum(1, 1_000_000)

Run the code above with bend run-cu file.py

And then you can compare that to a similar example in python using numba (or pycuda and JAX works as well). I am not sure however if python cuda is fair comparison, but you it's uh... something? I hope @VictorTaelin does not punish me on this because I have no idea how many threads per block and blocks per grid bend uses. So adjust whatever below to make it a fair comparison

import numpy as np
from numba import cuda
import time

@cuda.jit
def parallel_sum(arr, result):
    idx = cuda.grid(1)
    if idx < arr.size:
        cuda.atomic.add(result, 0, arr[idx])

def main():
    N = 1_000_000
    arr = np.arange(1, N + 1, dtype=np.int32)
    result = np.zeros(1, dtype=np.int32)

    threads_per_block = 256
    blocks_per_grid = (arr.size + (threads_per_block - 1)) // threads_per_block

    start_time = time.time()
    parallel_sum[blocks_per_grid, threads_per_block](arr, result)
    cuda.synchronize()
    end_time = time.time()

    print(f"Sum result: {result[0]}")
    print(f"Time taken: {end_time - start_time} seconds")

if __name__ == "__main__":
    main()
deadsoul44 commented 3 months ago

Did you run this? What are the results?

developedby commented 3 months ago

I don't have an Nvidia GPU to test, but you can expect Bend to be way way worse than CUDA for trivial arithmetic problems (including all of linear algebra, tensor operations, etc).

We purposefully avoid making benchmarks relative to CUDA because that's not the point of Bend.

koyomitan3 commented 3 months ago

I don't have an Nvidia GPU to test, but you can expect Bend to be way way worse than CUDA for trivial arithmetic problems (including all of linear algebra, tensor operations, etc).

We purposefully avoid making benchmarks relative to CUDA because that's not the point of Bend.

What is the point of bend? (No offense) A lot of attraction from what I've seen is the fact that you can seamlessly run with CUDA instead of having to use huge complicated CUDA libraries. The rest of bend seems to be so out of reach (advanced/abstract concepts like affine). Maybe it could help me find some inspiration if I find out the real point

@deadsoul44

Did you run this? What are the results?

For this specific test user@DESKTOP-C7548H1:~$ bend run-c x.bend -s Result: 5908768

user@DESKTOP-C7548H1:~$ bend run-cu x.bend -s Result: 5908768

developedby commented 3 months ago

What is the point of bend? (No offense) A lot of attraction from what I've seen is the fact that you can seamlessly run with CUDA instead of having to use huge complicated CUDA libraries. The rest of bend seems to be so out of reach (advanced/abstract concepts like affine). Maybe it could help me find some inspiration if I find out the real point

To run general programming in a massively parallel way by default. The advanced concepts (that are not really that complicated or advanced) that make it possible are not that important to users. CUDA is really good for writing a very specific set of programs, but for everything else it becomes so complicated to write an efficient program that it must either be done by incredible specialists or it's just not done at all.

For those programs that CUDA is really good (like tensor operations), Bend is not at all a competitor. GPUs and CUDA are designed from the ground up to do those very specific things incredibly efficiently.

PredragN commented 3 months ago

I have RTX 3070, CUDA is present but still bend run-cu doesn't work