HigherOrderCO / Bend

A massively parallel, high-level programming language
https://higherorderco.com
Apache License 2.0
17.27k stars 426 forks source link

GPU slower than Multi-threaded CPU on WSL / Windows 11 RTX 2050 #433

Open hoopdad opened 4 months ago

hoopdad commented 4 months ago

Describe the bug I ran the examples successfully with an 11.x version of CUDA yesterday. Unable to upgrade to 12.4 or 12.5.

I created and shared a screenshot video of running the same code 3 ways in my repo and steps I took: bend_demo_parallelism.mp4

WSL2 on Windows 11, followed install instructions as documented in the readme here: https://github.com/hoopdad/bendlang

Errors: Performance on GPU is slower than on multi-threaded CPU.

To Reproduce Please see README in the above link. The video also shows my windows resource monitor where the NVIDIA RAM gets loaded up so you know it is hitting the GPU.

Expected behavior I expected: GPU is faster than single- or multi-threaded CPU.

Desktop (please complete the following information):

Additional context

hoopdad commented 4 months ago

UPDATE I got it running per my steps outlined in the readme. I was missing the "nvidia-" prefix and the nsight* packages. But my question still remains - why is GPU slower than CPU/multi-threaded? Results are also dumped into the readme.

OJarrisonn commented 4 months ago

Isn't it more a WSL2 problem rather than a Bend problem?

I mean, i had a lot of issues when trying to use my RTX 3060Ti for AI with WSL2 due to virtualization

hoopdad commented 4 months ago

@OJarrisonn It may be, but I am not sure where the problem lies. I am a "it's not you, it's me" by default kind of guy so am assuming I have something misconfigured or just plain too low end.

I have it running on Cuda 12.5 now. Very similar results. I updated my readme with procedures and results (see above for link to it).

Is the RTX-2050 capable enough to be used for computations like this? Do I need to set some env variables to tune it, like does using 7+GB of shared memory (on top of the 4GB on the card) take away from the performance?

I was reading about various architectures and running the nvcc --list-gpu-arch to see what I have. Then setting it to just my highest architecture, though I'm not sure highest = best.

export CUDA_ARCHITECTURES="compute_90"

Note: just ran my example code after running the above with no difference in time.

kings177 commented 4 months ago

@hoopdad can you run everything with the flag -s and comment the results?

hoopdad commented 4 months ago

Here it is:

mike@bluewarrior:~/bend-lang$ ./bendrun.sh
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
starting CPU single thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 42.14s
- MIPS: 29.88

time to run: 42163
starting CPU multi thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 45.99s
- MIPS: 27.38

time to run: 46007
starting GPU multi thread run
Result: 16515072
- ITRS: 1259323365
- LEAK: 59858943
- TIME: 26.65s
- MIPS: 47.25

time to run: 29051

Here's the new shell script that I ran, FYI

nvcc --version

echo "starting CPU single thread run"
export START_NANO=`date +%s%3N`
bend run -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"

echo "starting CPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-c -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"

echo "starting GPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-cu -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"
kings177 commented 4 months ago

from what i can see, it took half the amount of time to finish on the GPU than on the CPU. no?

now, what is really concerning to me here, is the fact that the single-core rust run, is faster than the multi-core gcc run-c, ?! which doesn't make sense, could be a bug.

hoopdad commented 4 months ago

Member

That run does show GPU as the fastest, agreed. Prior runs were showing Multi-threaded CPU as much faster, single-threaded as slowest and GPU in the middle. I'll try some more runs and iron out some statistics. Maybe with my successful upgrade to Cuda 12.5 it is working as expected and my first run(s) had something concurrent running. It's just my personal laptop, and Windows, so...

Is that program that I used a good one for a basic benchmark? I got it from the readme but also see many in the examples folder. I'll try to run by end of the day today so we can hopefully close this issue.

hoopdad commented 4 months ago

I ran the example above, 100 times for each of Single Threaded CPU, Multi Threaded CPU and GPU. The raw results are attached. Same code/methodology as before.

100runs.xlsx

averages: CPU single thread run | 40.9678 CPU multi thread run | 23.0878 GPU multi thread run | 25.8416

ElijahBare commented 4 months ago

I am having a similar issue with this code:

# Collatz conjecture search in Bend
#Author - Elijah Bare

search_until = 10000

def collatz(n, count):
  if n == 1:
    return count + 1
  else:
    if n % 2 == 0:
      return collatz(n / 2, count + 1)
    else:
      return collatz(3*n + 1, count + 1)

def loop(highscore, high_start_val, i):
  iters = collatz(i, 0)
  if i < search_until:
    if iters > highscore:
      return loop(iters, i, i+1)
    else:
      return loop(highscore, high_start_val, i+1)
  else:
    return [highscore, high_start_val, i+1]

def main():
  return loop(0,0,1) #start with 0 as best scores

It runs well using the run-c command but when i run it with cuda it takes much longer, which doesnt make sense given its an M1 compared to a 4090 (on a vps obivously)

Is anyone else having this issuse?

TomasMonkevic commented 1 month ago

I have the same problem running the parallel_sum.bend benchmark on Windows 11 through WSL 2.0.

Results: image

Hardware: image