Nanosim-LIG / opencl-ruby

OpenCL bindings for Ruby
BSD 2-Clause "Simplified" License
40 stars 5 forks source link

Performance Benchmarks #3

Closed KCErb closed 9 years ago

KCErb commented 9 years ago

Hi again,

I wanted to do a quick comparison between MATLAB and openCL. My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it?

The Details

1) My openCL code gist

I have a late 2013 Macbook Pro, so it has 3 devices

Here's the benchmark results

               user     system      total        real
i7        22.650000   2.110000  24.760000 ( 22.617686)
IrisPro   21.610000   0.560000  22.170000 ( 22.202700)
GeForce   21.830000   0.570000  22.400000 ( 22.734518)

2) My MATLAB code gist

The average running time on MATLAB was 1.26 seconds.

My Guesses

My first guess is that since I'm new to both C and openCL I'm failing to accurately translate that openCL code to MATLAB.

My second guess is that NArray is less efficient than MATLAB's implementation.

Suggestions?

kpouget commented 9 years ago

Hello,

My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it?

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs MATLAB*!

you can't use such a trivial code compare CPU and GPU processors, what happens internally to run code on the GPU is way too complicated for that: kernel code compilation, kernel code+memory buffer transfers, remote (ie, on the GPU) execution, ...

you need to compare OpenCL with MATLAB/GPU code, or write a more complex code able to exploit the millions of cores of your GPU to see the results you're looking for!

(*I hope I didn't read your Matlab code wrong, but I don't see any instruction related to GPU)

----- Original Message -----

From: "KC Erb" notifications@github.com To: "Nanosim-LIG/opencl-ruby" opencl-ruby@noreply.github.com Sent: Friday, December 5, 2014 1:56:47 AM Subject: [opencl-ruby] Performance Benchmarks (#3)

Hi again,

I wanted to do a quick comparison between MATLAB and openCL. My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it? The Details

1. My openCL code gist 

I have a late 2013 Macbook Pro, so it has 3 devices

* 2.3 GHz Intel Core i7 
* Intel Iris Pro 
* NVIDIA GeForce GT 750M 2048 MB 

Here's the benchmark results user system total real i7 22.650000 2.110000 24.760000 ( 22.617686) IrisPro 21.610000 0.560000 22.170000 ( 22.202700) GeForce 21.830000 0.570000 22.400000 ( 22.734518)

1. My MATLAB code gist 

The average running time on MATLAB was 1.26 seconds. My Guesses

My first guess is that since I'm new to both C and openCL I'm failing to accurately translate that openCL code to MATLAB.

My second guess is that NArray is less efficient than MATLAB's implementation.

Suggestions?

— Reply to this email directly or view it on GitHub .

Kerilk commented 9 years ago

Hello,

Just to complete Kevin's answer:

the only timing you want to look at (at first) is the timing of the kernel run in this case:

event= prog.addition(queue, [n_times], float, b_in, b_out,:local_work_size => [128])

which you can do after the queue finishes: queue.finish puts "#{(event.profiling_command_end - event.profiling_command_start)} ns"

To be fair, in matlab you should only time the computation (not the random array initialization):

n_times = 2^26; a_in = rand(1,n_times); t1 = cputime; a_in(:) = a_in(:)+ 5.0;cputime- t1

Brice

PS: Ruby gives you tools to avoid copy pasting. Find attached a slightly reworked benchmark sample (though it is still giving you unfair results)

On 05/12/2014 07:52, Kevin Pouget wrote:

Hello,

My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it?

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs MATLAB*!

you can't use such a trivial code compare CPU and GPU processors, what happens internally to run code on the GPU is way too complicated for that: kernel code compilation, kernel code+memory buffer transfers, remote (ie, on the GPU) execution, ...

you need to compare OpenCL with MATLAB/GPU code, or write a more complex code able to exploit the millions of cores of your GPU to see the results you're looking for!

(*I hope I didn't read your Matlab code wrong, but I don't see any instruction related to GPU)

----- Original Message -----

From: "KC Erb" notifications@github.com To: "Nanosim-LIG/opencl-ruby" opencl-ruby@noreply.github.com Sent: Friday, December 5, 2014 1:56:47 AM Subject: [opencl-ruby] Performance Benchmarks (#3)

Hi again,

I wanted to do a quick comparison between MATLAB and openCL. My first results are that MATLAB is 20 times faster at executing the example addition operation in the README. That can't be right, can it? The Details

  1. My openCL code gist

I have a late 2013 Macbook Pro, so it has 3 devices

  • 2.3 GHz Intel Core i7
  • Intel Iris Pro
  • NVIDIA GeForce GT 750M 2048 MB

Here's the benchmark results user system total real i7 22.650000 2.110000 24.760000 ( 22.617686) IrisPro 21.610000 0.560000 22.170000 ( 22.202700) GeForce 21.830000 0.570000 22.400000 ( 22.734518)

  1. My MATLAB code gist

The average running time on MATLAB was 1.26 seconds. My Guesses

My first guess is that since I'm new to both C and openCL I'm failing to accurately translate that openCL code to MATLAB.

My second guess is that NArray is less efficient than MATLAB's implementation.

Suggestions?

— Reply to this email directly or view it on GitHub .

— Reply to this email directly or view it on GitHub https://github.com/Nanosim-LIG/opencl-ruby/issues/3#issuecomment-65753958.

KCErb commented 9 years ago

Thanks for the feedback all, I'll implement the things you suggested and get back with results, I can't do the GPU part until I get up to the school today since I don't have a MATLAB gpu license but my institution does.

Before I get up there though, I have some follow up questions:

Find attached a slightly reworked benchmark sample

Thanks for this Brice, but I don't see an attachment. I'm viewing this conversation on Github and I don't think it supports attachments like that. I checked the email version as well and still no attachments. Some alternatives would be to email me directly, or copy and paste your code into http://gist.github.com and share the link with me either via email or responding to this comment.

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby vs MATLAB*!

Thanks Kevin, I'll definitely get back to you with GPU results ASAP (along with Brice's suggestions of where to put my timing events) but I do have a question about comparing CPU to CPU.

I thought that with my code, my first run is on the CPU.

If it is on the CPU, and the CPU doesn't take the extra preparation work that the GPU does (as MATLAB seems to demonstrate), then why does my openCL CPU code take 22 seconds to run whereas the MATLAB from start to finish is just a couple of seconds?

I think I can understand what Brice means by saying I should compare the kernel time (system) to MATLAB, instead of total user time. It also makes sense to me that trivial calculations like this are not likely to really give a fair comparison. Can you help give me an idea of a sample that would? For example, when I do 2^16 as my vector length, it looks like the OCL code is comparable to (and maybe a little quicker than) matlab, but it's a tough call since they both just plain do it so fast!

I only ramped up the vector length to 2^26 so that I could actually start using the CPU a bit on the MATLAB side.

Would it perhaps be a closer comparison to do a vector of 2^20 length 1,000,000 times? I'm just trying to look for ways to demonstrate to someone who's never heard of openCL that it can put jobs on CPUs or GPUs and it will maximize use of the device more or less automatically (because of openCL's concept of work items).

Perhaps my question just exposes my ignorance :)

Perhaps a little context would be helpful :)

I'm a graduate student in Physics with an emphasis on Magnetic Resonance Imaging (MRI). I've been doing a lot of work recently on image reconstruction and have thought that I'd really like to stop using MATLAB for this kind of work since I've recently been introduced to Ruby and prefer it 100 times over MATLAB. My PI is open to the idea of my PhD thesis being centered around building Ruby tools for image reconstruction if I can demonstrate its utility.

So I'm kind of trying to strike a balance here. If I can put together a basic proof of concept and demonstrate that I can write code in Ruby that does a basic math function on my computer and our server GPUs faster or at least comparable to what MATLAB does, then I'll probably be given a green light and be working exclusively in this field for the remainder of my PhD work: 2-3 years.

So if I knew more I could build a better proof of concept, if I had a better proof of concept I could have more time to dedicate to learning more.

That's not to say I expect anything from you guys! You've been really wonderful in helping someone completely new to C, openCL, HPC everything already. I just thought it may be useful / interesting to you to know where I'm coming from and where I'm trying to go.

Thanks, KC

Kerilk commented 9 years ago

On 05/12/2014 15:20, KC Erb wrote:

Thanks for the feedback all, I'll implement the things you suggested and get back with results, I can't do the GPU part until I get up to the school today since I don't have a MATLAB gpu license but my institution does.

Before I get up there though, I have some follow up questions:

Find attached a slightly reworked benchmark sample

Thanks for this Brice, but I don't see an attachment. I'm viewing this conversation on Github and I don't think it supports attachments like that. I checked the email version as well and still no attachments. Some alternatives would be to email me directly, or copy and paste your code into http://gist.github.com and share the link with me either via email or responding to this comment.

here is the code:

https://gist.github.com/Kerilk/2fa146ab7d1135416f12

you're comparing GPU and CPU code, you're not fair with C/OCL/Ruby
vs MATLAB*!

Thanks Kevin, I'll definitely get back to you with GPU results ASAP (along with Brice's suggestions of where to put my timing events) but I do have a question about comparing CPU to CPU.

I thought that with my code, my first run is on the CPU.

If it is on the CPU, and the CPU doesn't take the extra preparation work that the GPU does (as MATLAB seems to demonstrate), then why does my openCL CPU code take 22 seconds to run

Because your OpenCL implementation on the Mac is pretty slow don't know why. On my laptop: videau@nedni:/tmp$ ruby opencl_addition.rb user system total real Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz: 0.630000 0.020000
0.650000 ( 0.598329) This is running the modified script. Size is 2**20.

whereas the MATLAB from start to finish is just a couple of seconds?

I think I can understand what Brice means by saying I should compare the kernel time (system) to MATLAB, instead of total user time. It also makes sense to me that trivial calculations like this are not likely to really give a fair comparison. Can you help give me an idea of a sample that would? For example, when I do 2^16 as my vector length, it looks like the OCL code is comparable to (and maybe a little quicker than) matlab, but it's a tough call since they both just plain do it so fast!

The timing method I showed gives results in the nanosecond range (though precision will of course depend on your hardware counter accuracy). For kernels on more than a few thousands elements it will be sufficient: videau@nedni:~/dev/opencl-ruby/opencl_ruby_ffi/test$ ruby small_test.rb 96653 ns Success!

This is for size 2**16

I only ramped up the vector length to 2^26 so that I could actually start using the CPU a bit on the MATLAB side.

same problem here, you need precise timings. Maybe there are modules in matlab that gives more accurate timings. If there is not maybe you can create one, using clock_gettime and CLOCK_REALTIME.

Would it perhaps be a closer comparison to do a vector of 2^20 length 1,000,000 times? I'm just trying to look for ways to demonstrate to someone who's never heard of openCL that it can put jobs on CPUs or GPUs and it will maximize use of the device more or less automatically (because of openCL's concept of work items).

There is some research going on to try to do it on the more side of things: https://runtime.bordeaux.inria.fr/StarPU/doc/html/SOCLOpenclExtensions.html

Brice

Perhaps my question just exposes my ignorance :)

Perhaps a little context would be helpful :)

I'm a graduate student in Physics with an emphasis on Magnetic Resonance Imaging (MRI). I've been doing a lot of work recently on image reconstruction and have thought that I'd really like to stop using MATLAB for this kind of work since I've recently been introduced to Ruby and prefer it 100 times over MATLAB. My PI is open to the idea of my PhD thesis being centered around building Ruby tools for image reconstruction if I can demonstrate its utility.

So I'm kind of trying to strike a balance here. If I can put together a basic proof of concept and demonstrate that I can write code in Ruby that does a basic math function on my computer and our server GPUs faster or at least comparable to what MATLAB does, then I'll probably be given a green light and be working exclusively in this field for the remainder of my PhD work: 2-3 years.

So if I knew more I could build a better proof of concept, if I had a better proof of concept I could have more time to dedicate to learning more.

That's not to say I expect anything from you guys! You've been really wonderful in helping someone completely new to C, openCL, HPC everything already. I just thought it may be useful / interesting to you to know where I'm coming from and where I'm trying to go.

Thanks, KC

— Reply to this email directly or view it on GitHub https://github.com/Nanosim-LIG/opencl-ruby/issues/3#issuecomment-65795305.Web Bug from https://github.com/notifications/beacon/ABuyZtSx6hA9WAKxmvfP9xwI3szcQL_iks5nUba2gaJpZM4DEkQE.gif

KCErb commented 9 years ago

Wow, that's a great project, SOCL, exactly the sort of thing I'm looking for, I'll have to learn more!

I'm getting ready to put together some more appropriate benchmarks together on CPU and GPU in the next couple of hours here so I'll get back to you on that, but I thought I'd quickly respond to one thing:

Because your OpenCL implementation on the Mac is pretty slow don't know why.

I'm not sure if it is the implementation, here are my results (using your modified benchmark code, thanks!) on 2**20

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    0.350000   0.040000   0.390000 (  0.349098)
Iris Pro:                                     0.360000   0.010000   0.370000 (  0.366742)
GeForce GT 750M:                              0.350000   0.010000   0.360000 (  0.375415)

I just find that 2**21 is twice as slow

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    0.730000   0.060000   0.790000 (  0.729077)
Iris Pro:                                     0.700000   0.020000   0.720000 (  0.729866)
GeForce GT 750M:                              0.710000   0.030000   0.740000 (  0.750479)

and 2**22 twice as slow again

                                                  user     system      total        real
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz:    1.440000   0.140000   1.580000 (  1.441325)
Iris Pro:                                     1.380000   0.030000   1.410000 (  1.421135)
GeForce GT 750M:                              1.380000   0.050000   1.430000 (  1.455044)

so that's why my 2**26 is taking 22+ seconds.

Of course, as you pointed out before I'm not really doing it right since I should be comparing the kernels to each other, not the whole program. My question here is more about why does the program take so long? On a vector of 2**26 is it spending 22 seconds setting things up? Is there any way to cut that time down, or is the example so bad that it's really not worth talking about / optimizing?

Mmmm, I find this stuff so exciting! I really hope I can get this project approved and dig in! Thanks again :smile:

KCErb commented 9 years ago

Well, I'm up against a little hiccup on using the GPU since my school's machine that has the gpu license doesn't have a Ruby interpreter and I don't have admin rights, so I'll need to work on getting Ruby on that machine.

But until that get's worked out, I'll at least report my findings pitting my first device (the CPU) against MATLAB (also CPU) in the way Brice suggested.

With this code: gist I get this output

Warning OpenCL 1.2 loader detected!
404745 ns

and that's about average, 0.4 s

With this matlab code: gist

I get an average of ~400ns. So it seems that something is still amiss. Most likely in my understanding . . .

Kerilk commented 9 years ago

On 05/12/2014 22:04, KC Erb wrote:

Well, I'm up against a little hiccup on using the GPU since my school's machine that has the gpu license doesn't have a Ruby interpreter and I don't have admin rights, so I'll need to work on getting Ruby on that machine.

But until that get's worked out, I'll at least report my findings pitting my first device (the CPU) against MATLAB (also CPU) in the way Brice suggested.

With this code: gist https://gist.github.com/KCErb/158e7d4e433b710dd52e I get this output

Warning OpenCL 1.2 loader detected! 404745 ns

and that's about average, 0.4 s

didn't you mean 0.0004s ?

With this matlab code: gist https://gist.github.com/KCErb/aa0377a61f6f4b4e0270

I get an average of ~400ns. So it seems that something is still amiss. Most likely in my understanding . . .

— Reply to this email directly or view it on GitHub https://github.com/Nanosim-LIG/opencl-ruby/issues/3#issuecomment-65854254.

KCErb commented 9 years ago

Oh wow Brice! I can't believe I got nano and micro confused :blush: that means my MATLAB code was running about the same speed at 400 _micro_seconds not nano!

Whew, thanks! I'll probably be able to work on the GPU stuff tomorrow. After I've posted / discussed those benchmarks I'll close this out.

KCErb commented 9 years ago

OK I've finally got access to our GPU system and got it all running.

I was very pleased with how easy it was to run the addition kernel on our GPUs once I convinced our systems admin to let me put RVM in my home folder!

The results of the benchmarking are as follows:

On my laptop's CPU MATLAB and openCL are clocking in about the same. Averaged over 35 runs I get MATLAB openCL
473.06 µs 386.75 µs

On my laptop's GPU I get on average 42.64 µs

At my institution's compute server's Tesla C2070 I get an average of 10.58 µs

Getting timing out of MATLAB is hard though, so the result is that MATLAB using the Tesla GPU is coming in around 100 µs but it's hard to tell.

The trouble with MATLAB is that they don't offer high resolution cpu timing.

Their only timing functions are cputime, and tic/toc. cpu time is cpu time only, meaning unaffected by CPU workload, but it's resolution maxes out around a few milliseconds.

tic and toc use wall-clock time and are very high resolution.

MATLAB introduced a gputimeit function in R2014, but my institution doesn't have any GPU R2014 licenses.

Summary: As expected, Ruby on openCL has no reason to not be super fast. It's hard to really gauge what's going on with MATLAB so for now I'll call it a tie and safely assume that openCL can be faster than MATLAB if used right.

Thanks for the input and help!

PS. I just got this project halfway through the approval process last Friday so things are on track for me to really dig in next year!

Kerilk commented 9 years ago

Great news!

your results are sensible so I think you got everything right. Don't hesitate if you encounter further problems.

Brice

On 15/12/2014 23:23, KC Erb wrote:

OK I've finally got access to our GPU system and got it all running.

I was very pleased with how easy it was to run the addition kernel on our GPUs once I convinced our systems admin to let me put RVM in my home folder!

The results of the benchmarking are as follows:

On my laptop's CPU MATLAB and openCL are clocking in about the same. Averaged over 35 runs I get MATLAB openCL
473.06 µs 386.75 µs

On my laptop's GPU I get on average 42.64 µs

At my institution's compute server's Tesla C2070 I get an average of 10.58 µs

Getting timing out of MATLAB is hard though, so the result is that MATLAB using the Tesla GPU is coming in around 100 µs but it's hard to tell.

The trouble with MATLAB is that they don't offer high resolution cpu timing.

Their only timing functions are cputime, and tic/toc. cpu time is cpu time only, meaning unaffected by CPU workload, but it's resolution maxes out around a few milliseconds.

tic and toc use wall-clock time and are very high resolution.

MATLAB introduced a |gputtimeit| function in R2014, but my institution doesn't have an GPU R2014 licenses.

Summary: As expected, Ruby on openCL has no reason to not be super fast. It's hard to really gauge what's going on with MATLAB so for now I'll call it a tie and safely assume that openCL can be faster than MATLAB if used right.

Thanks for the input and help!

PS. I just got this project halfway through the approval process last Friday so things are on track for me to really dig in next year!

— Reply to this email directly or view it on GitHub https://github.com/Nanosim-LIG/opencl-ruby/issues/3#issuecomment-67077693.