Open hoopdad opened 4 months ago
UPDATE I got it running per my steps outlined in the readme. I was missing the "nvidia-" prefix and the nsight* packages. But my question still remains - why is GPU slower than CPU/multi-threaded? Results are also dumped into the readme.
Isn't it more a WSL2 problem rather than a Bend problem?
I mean, i had a lot of issues when trying to use my RTX 3060Ti for AI with WSL2 due to virtualization
@OJarrisonn It may be, but I am not sure where the problem lies. I am a "it's not you, it's me" by default kind of guy so am assuming I have something misconfigured or just plain too low end.
I have it running on Cuda 12.5 now. Very similar results. I updated my readme with procedures and results (see above for link to it).
Is the RTX-2050 capable enough to be used for computations like this? Do I need to set some env variables to tune it, like does using 7+GB of shared memory (on top of the 4GB on the card) take away from the performance?
I was reading about various architectures and running the nvcc --list-gpu-arch
to see what I have. Then setting it to just my highest architecture, though I'm not sure highest = best.
export CUDA_ARCHITECTURES="compute_90"
Note: just ran my example code after running the above with no difference in time.
@hoopdad can you run everything with the flag -s
and comment the results?
Here it is:
mike@bluewarrior:~/bend-lang$ ./bendrun.sh
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
starting CPU single thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 42.14s
- MIPS: 29.88
time to run: 42163
starting CPU multi thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 45.99s
- MIPS: 27.38
time to run: 46007
starting GPU multi thread run
Result: 16515072
- ITRS: 1259323365
- LEAK: 59858943
- TIME: 26.65s
- MIPS: 47.25
time to run: 29051
Here's the new shell script that I ran, FYI
nvcc --version
echo "starting CPU single thread run"
export START_NANO=`date +%s%3N`
bend run -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"
echo "starting CPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-c -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"
echo "starting GPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-cu -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"
from what i can see, it took half the amount of time to finish on the GPU than on the CPU. no?
now, what is really concerning to me here, is the fact that the single-core rust run
, is faster than the multi-core gcc run-c
, ?! which doesn't make sense, could be a bug.
Member
That run does show GPU as the fastest, agreed. Prior runs were showing Multi-threaded CPU as much faster, single-threaded as slowest and GPU in the middle. I'll try some more runs and iron out some statistics. Maybe with my successful upgrade to Cuda 12.5 it is working as expected and my first run(s) had something concurrent running. It's just my personal laptop, and Windows, so...
Is that program that I used a good one for a basic benchmark? I got it from the readme but also see many in the examples folder. I'll try to run by end of the day today so we can hopefully close this issue.
I ran the example above, 100 times for each of Single Threaded CPU, Multi Threaded CPU and GPU. The raw results are attached. Same code/methodology as before.
averages: CPU single thread run | 40.9678 CPU multi thread run | 23.0878 GPU multi thread run | 25.8416
I am having a similar issue with this code:
# Collatz conjecture search in Bend
#Author - Elijah Bare
search_until = 10000
def collatz(n, count):
if n == 1:
return count + 1
else:
if n % 2 == 0:
return collatz(n / 2, count + 1)
else:
return collatz(3*n + 1, count + 1)
def loop(highscore, high_start_val, i):
iters = collatz(i, 0)
if i < search_until:
if iters > highscore:
return loop(iters, i, i+1)
else:
return loop(highscore, high_start_val, i+1)
else:
return [highscore, high_start_val, i+1]
def main():
return loop(0,0,1) #start with 0 as best scores
It runs well using the run-c command but when i run it with cuda it takes much longer, which doesnt make sense given its an M1 compared to a 4090 (on a vps obivously)
Is anyone else having this issuse?
I have the same problem running the parallel_sum.bend
benchmark on Windows 11 through WSL 2.0.
Results:
Hardware:
Describe the bug I ran the examples successfully with an 11.x version of CUDA yesterday. Unable to upgrade to 12.4 or 12.5.
I created and shared a screenshot video of running the same code 3 ways in my repo and steps I took: bend_demo_parallelism.mp4
WSL2 on Windows 11, followed install instructions as documented in the readme here: https://github.com/hoopdad/bendlang
Errors: Performance on GPU is slower than on multi-threaded CPU.
To Reproduce Please see README in the above link. The video also shows my windows resource monitor where the NVIDIA RAM gets loaded up so you know it is hitting the GPU.
Expected behavior I expected: GPU is faster than single- or multi-threaded CPU.
Desktop (please complete the following information):
Additional context