Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.75k stars 313 forks source link

OpenCl #1678

Open Beep6581 opened 9 years ago

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 1694

As I mentioned in my presentation (http://rawtherapee.com/forum/viewtopic.php?f=9&t=4421)
OpenCl is one of my main interests, because it can outperform actual multi-core-cpu-systems
by factor 20 or more by simply using the GPU of the Graphic-Card.
I made a simple test in implementing dcdamping-part of RL-Deconv to my GeForce Gt640,
and it took less than 5% of the time, as my 8-Core-4Ghz-Machine took for the same thing,
but dcdamping was very easy to convert...

My next step will be the making of a GPU-Version of gaussian blur, which gives the
same results as the Young-van Vliet Implementation, which is used in RT actually. 
That should also be no problem...

I also plan to create a framework, which is usable in a project like Rt. It should
perform well at different systems with different GPUs, which have different capabilities
etc.

I already have some ideas, how to design that framework, but because I've only one
machine at the moment (8-core-Amd with GeForce Gt640), I would be glad, if there would
be somebody, who also has a OpenCl 1.1 capable GPU (not necessary different from mine),
to contribute this work.

Would also be nice, if we could make a OpenCl-Branch in RT-Repository, so we could
test without influence to the default branch.

This will be a large process, but it's very interesting and has a lot of potential,
not only to speed up RT, also to broaden your mind. If somebody of the team is interested
in this theme, I offer my assistance.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-24 23:20:08

Beep6581 commented 9 years ago
Hi Ingo, this is extremely exciting for this project. I have at my disposal GTX 480
and i7 3930K with 64GB ram and would be glad to participate.

Regarding the new branch, it would be smoother to create it after tif32 merges into
default. Another branch, xmp, was in progress also, may be Hombre could suggest how
to coordinate this with the least impact.

Reported by michaelezra000 on 2013-01-25 04:37:40

Beep6581 commented 9 years ago
Hi Ingo

I am interested in this topic, for me (and I think for others)it will be a new challenge.

My configuration:
Corei7 - 8M°
GeForce GT 650M

Reported by jdesmis on 2013-01-25 08:07:22

Beep6581 commented 9 years ago
Hi Ingo,

Thanks for all your help in making RT better (and faster).  I continue to have very
little time for coding or testing, but would be happy to answer any questions about
the code that I authored.  

It would be great if AMaZE demosaic and the NR module could be implemented on the GPU.
 I suspect it might be better if AMaZE was restructured at the same time -- quite a
few times in the code there are calculations made and stored in memory as intermediate
results for later use, because I found that it was faster than calculating them as
needed.  I suspect the balance may tip toward calculating when needed on the GPU, where
memory is more of a premium.

I agree though that the first step is to convert some of the more general purpose tools
like gaussian blur and box blur.

Reported by ejm.60657 on 2013-01-25 15:16:50

Beep6581 commented 9 years ago
Hi Jaques, Hi Micheal,

I'm very glad about your answers :-) 
I agree with you, Micheal, regarding the new branch.
It's also a new challenge for me. I've read a lot of docs. Can recommend the book 'OpenCl
in Action'. The Author also provides the examples from his book free for everyone.
Just search 'manning opencl' and you'll find them.
Very nice, that we have 3 different GPUs.
I'll send you my dcdamping test-kernel per mail, so you've something to play with ;-)

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-25 16:12:43

Beep6581 commented 9 years ago
@comment3: I chose rl-deconv, because I'm also new to OpenCl and implementing this part
in OpenCl is not so difficult. In fact, dcdamping was super easy, but gauss will be
a mit more work, because the Young-van Vliet Implementation can't be used at GPU as
easy as the dcdamping. But I've had a look at the papers to 'Young-van Vliet Implementation'
and we can simply replace it by normal gaussian blur on GPU.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-25 16:22:37

Beep6581 commented 9 years ago
The nice thing about the Young-van Vliet approach is that it's an IIR filter rather
than an FIR filter, so the time cost is independent of the radius of the Gaussian blur.
 Normal implementations with an FIR filter have a time cost that is substantial for
large radius blurs, at least for implementations on standard CPU's, but maybe that
is not an issue for GPU's.  At least, the IIR filter approach is not well adapted to
2d parallelization, since the filter is cumulative as you go down a row/column (ie
the filter on a pixel depends on the result of the calculation for the previous pixel
on a row/column).  One can however clearly parallelize the calculations by doing all
rows/columns in parallel even if within a row/column there is not a parallelization
possible.  Anyway, what the GPU will do is change the point where it is faster to do
the IIR Y-vV filter as opposed to the standard FIR gaussian filter; I will be curious
to see where that point is, and if the FIR approach is always faster except for ridiculously
large blur radii (as for instance might be used in a S/H tool).

Reported by ejm.60657 on 2013-01-25 17:09:38

Beep6581 commented 9 years ago
I'll try both :-)

Reported by heckflosse@i-weyrich.de on 2013-01-25 19:05:59

Beep6581 commented 9 years ago
Me again. Forgot to mention following docs, which are very interesting to OpenCl newbies,
especially for Nvidia GPUs:

http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf
http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf

Would be glad, if there's somebody with an AMD GPU, at least for testing...

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-25 23:20:14

Beep6581 commented 9 years ago
Ingo, there is a code sample for recursiveGaussian
"This sample implements a Gaussian blur using Deriche's recursive method. The advantage
of this method is that the execution time is independent of the filter width."

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\recursiveGaussian
could this be used for our purposes?

There is also an interesting example of denoising on GPU:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\imageDenoising
This uses K Nearest Neighbors and Non Local Means filters, including fast version if
the latter.  
On Windows sample application can be run from
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release\imageDenoising.exe

PDF paper: 
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\imageDenoising\doc\imageDenoising.pdf
 (by Alexander Kharlamov, Victor Podlozhnyuk)
Emil, do you know how would this method compare to RT's?

Reported by michaelezra000 on 2013-01-27 16:52:26

Beep6581 commented 9 years ago
Hi Michael, these are CUDA samples, but they'll be transformable to OpenCl. 

But I think, it would be better, to make OpenCL-Versions of the things, we actually
use in RT, because e.g. when we use Deriche's recursive method instead of Y-vV, the
results could be different and we would waste a lot of time, to describe why they are
different, or to make them equal. Another point is, that, when using methods already
used in RT, we know, that they're under 'GNU General Public License'.

Btw: I'm actually working on OpenCL version of Y-vV-Implementation.

Did you get my example of dcdamping working?

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-27 18:07:35

Beep6581 commented 9 years ago
I was able to build the exe (thanks for the email!)
Here is the output:

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_
query cl_nv_pragma_unroll
  -- 12126736 --
  DEVICE_NAME = GeForce GTX 480
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.1 CUDA
  DRIVER_VERSION = 306.94
  DEVICE_MAX_COMPUTE_UNITS = 15
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1401
  DEVICE_GLOBAL_MEM_SIZE = 1610612736
Error: Failed to create compute program!

Reported by michaelezra000 on 2013-01-27 20:42:51

Beep6581 commented 9 years ago
Hi Michael,

it seems, that the the dcdamping.cl is not in the same directory as the main.cc.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-27 21:29:41

Beep6581 commented 9 years ago
after I placed dcdamping.cl next to the exe file (not the main.cc), I got this output:

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_
query cl_nv_pragma_unroll
  -- 12126736 --
  DEVICE_NAME = GeForce GTX 480
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.1 CUDA
  DRIVER_VERSION = 306.94
  DEVICE_MAX_COMPUTE_UNITS = 15
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1401
  DEVICE_GLOBAL_MEM_SIZE = 1610612736
100 iterations took 180 clicks (0.180000 seconds).
Result[0] = 0.000000
nans are handled well

Reported by michaelezra000 on 2013-01-27 21:52:29

Beep6581 commented 9 years ago
Congrats, Michael! Your GPU is pretty fast. That means, your GPU takes 0,18 seconds
to make 100 dcdamping iterations on the Lab-Part of a 12 MPix-Image. My GPU takes 0.827
seconds for the the same amount of work and my 8-Core-CPU takes about 20 seconds.

The last two lines were only for testing of nan-handling. On a GPU, you can divide
be zero, without getting exceptions, and can handle this later by checking, wether
the result is Nan, which I did to get better performance in the dcdamping-kernel.

Glad, you're on board :-)

Ingo

Reported by heckflosse@i-weyrich.de on 2013-01-27 23:08:59

Beep6581 commented 9 years ago
Can i run this test too?

Reported by natureh.510 on 2013-01-27 23:44:24

Beep6581 commented 9 years ago
wow... I don't think I will need to upgrade my CPU for a good few years, GPU has to
be the future:)

I put together a small archive with all components to get it rolling a bit easier on
windows:
http://filebin.net/upload/7du3q58iw5
instructions are in _Readme.txt file - there is source code from Ingo, a batch file
to build, dependency libopencl.a and the compiled executable.

Reported by michaelezra000 on 2013-01-28 01:40:05

Beep6581 commented 9 years ago
Thanks a lot, Michael!

Reported by heckflosse@i-weyrich.de on 2013-01-28 18:58:38

Beep6581 commented 9 years ago
And here is the result for my laptop.

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll
  -- 12126736 --
  DEVICE_NAME = GeForce GT 630M
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.1 CUDA
  DRIVER_VERSION = 306.94
  DEVICE_MAX_COMPUTE_UNITS = 2
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1344
  DEVICE_GLOBAL_MEM_SIZE = 2147483648
100 iterations took 875 clicks (0.875000 seconds).
Result[0] = 0.000000
nans are handled well

System: Win7 / Core i7 / 12Gb RAM

Reported by natureh.510 on 2013-01-29 19:59:56

Beep6581 commented 9 years ago
  ...\OpenCL_test01\bin\Debug>OpenCL_test01x64.exe
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.2 AMD-APP (938.2)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
  EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3
d10_sharing
  -- 12126736 --
  DEVICE_NAME = Juniper
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VENDOR_ID = 4098
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (938.2)
  DRIVER_VERSION = CAL 1.4.1741 (VM)
  DEVICE_MAX_COMPUTE_UNITS = 9
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 4
  DEVICE_MAX_CLOCK_FREQUENCY = 800
  DEVICE_GLOBAL_MEM_SIZE = 1073741824
100 iterations took 7356 clicks (7.356000 seconds).
Result[0] = 0.000000
nans are handled well
Ran the attached exe on: Win7 / Core i7 / 16b RAM / Radeon HD6700

Reported by 121nilsson on 2013-01-29 20:25:57

Beep6581 commented 9 years ago
Fine, seems to be some interest in this thing :-)
@ #19: Your Nvidia 630M seems to perform almost like my 640GT
@ #20: Your Juniper Device doesn't seem to perform well with my dcdamping-kernel, but
it's even 3 times faster than my CPU . But your device has 'CL_DEVICE_PREFERRED_VECTOR_WIDTH_float
= 4', so I'll make a sample, which should speed up at your machine. In fact, I'm very
glad to have somebody on board with a different GPU. Most of us have Nvidia, I think.

Reported by heckflosse@i-weyrich.de on 2013-01-29 22:13:38

Beep6581 commented 9 years ago
I apologize for such basic questions,but this looks very interesting and I'd like to
give it a try -if at all possible- with my very minimal understanding of RT:
-will this test run on an AMD (SAPPHIRE HD 5450 1GB DDR3) graphic card ?
-is it possible to do this on Linux ?

Reported by msth67 on 2013-01-30 14:36:54

Beep6581 commented 9 years ago
#23: You don't need to have any understanding of RT for this simple test.
OpenCl should work with your card, but I don't know, what to install, because I've
a Nvidia Card. Maybe 121nilsson can help you?

Reported by heckflosse@i-weyrich.de on 2013-01-30 16:01:21

Beep6581 commented 9 years ago
#23. Download the zip-file in post #17 from michael and unzip and run the .exe ...OpenCL_test01\bin\Debug>OpenCL_test01x64.exe
from cmd prompt.

Reported by 121nilsson on 2013-01-30 16:35:37

Beep6581 commented 9 years ago
#25: I guess, he is using Linux...

Reported by heckflosse@i-weyrich.de on 2013-01-30 18:29:33

Beep6581 commented 9 years ago
Correct,I'm on Linux and the test contained in the zip file doesn't look like it will
run on my computer.

Reported by msth67 on 2013-01-30 18:56:33

Beep6581 commented 9 years ago
The test also contains the sources. But I don't know how to setup your system to compile
the example on Linux. Sorry :-(

Reported by heckflosse@i-weyrich.de on 2013-01-30 23:03:16

Beep6581 commented 9 years ago
Sorry, was tired.. I only use/know windows and visual studio.

Reported by 121nilsson on 2013-02-01 19:07:44

Beep6581 commented 9 years ago
Hi Ingo,

Sorry if I disturb you but I would like to know how to choose a GPU which will be fast
with OpenCL accelerated RT.

Is this benchmark indicative of the performance OpenCL RT will have on various GPUs
http://www.tomshardware.com/charts/2012-vga-gpgpu/15-GPGPU-Luxmark,2971.html  

Reported by iliasgiarimis on 2013-02-02 00:53:42

Beep6581 commented 9 years ago
I don't know, because I've only one graphics card. But you'll have a lot of time, to
decide, which one is best, because we're just at the beginning of OpenCL RT.

Reported by heckflosse@i-weyrich.de on 2013-02-02 13:36:06

Beep6581 commented 9 years ago
Some of the most important, but generally overlooked, aspects of the video card is the
memory bandwidth and the number of CUDA cores. In the comparison (comment 30) Nvidia
cards feel highly misrepresented. I purchased GTX-480 card in Summer 2012 and then
it was the most cost efficient GPU. For $200 I had 50-60% of a $1000 card, I think
it was a just released then GTX-690. I decided to not to purchase the highest end GPU
cards as they depreciate fast, but still have only a few GB or memory. I hope that
in a few years we will see GPUs with 32GB+ RAM and software to utilize it.

GTX-480 is the hottest component in my PC. i7 is water cooled , but it never gets hot.
GTX-480 is hot even when idle, but it can take it. There are GPU water cooling options
as well. Newer GPU models have smaller transistor size and are more power efficient.
GPU is evolving fast and my guess is there should be another product now which is in
a similar cost/performance niche vs the high $ end as GTX-480 in summer 2012.

Reported by michaelezra000 on 2013-02-02 18:13:34

Beep6581 commented 9 years ago
AMD 7850 owner here. No exe available at the link in #17 anymore. Let me know how I
can help.

Reported by hasse.schougaard on 2013-02-13 22:10:17

Beep6581 commented 9 years ago
I ran the test om my budget computer.
Win7 64-bit/ AMD ATHLON II 4-core / 4b RAM / Radeon HD6450

C:\compile\OpenCL>C:\compile\OpenCL\bin\Debug\opencl_x64.exe
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.2 AMD-APP (1084.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
  EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3
d10_sharing cl_khr_d3d11_sharing
  -- 12126736 --
  DEVICE_NAME = Caicos
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VENDOR_ID = 4098
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (1084.4)
  DRIVER_VERSION = 1084.4 (VM)
  DEVICE_MAX_COMPUTE_UNITS = 2
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 4
  DEVICE_MAX_CLOCK_FREQUENCY = 625
  DEVICE_GLOBAL_MEM_SIZE = 536870912
100 iterations took 33136 clicks (33.136002 seconds).
Result[0] = 0.000000
nans are handled well

The results are not very impressive compared with the high end intel i7 machines.

Reported by willemtermeer on 2013-02-16 20:26:47

Beep6581 commented 9 years ago
That's in fact really bad, especially compared to #13.
But, if you are you able to compile the source, I'll make a version, which uses VECTOR_WIDTH
4. Would be very interesting, if that would speedup up at your system. One of my main
reasons, to make this little test was to gather informations, whether it's worth to
consider the vector-capabilities of the GPU.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-02-16 21:01:30

Beep6581 commented 9 years ago
Hi
Does the program compile and run in Linux? I ask before trying because getting cuda-5.0
requires a bit of tinkering in Gentoo 64 stable (4.2.9 is the latest stable) so I'd
like to know whether it should work before I make a mess.

Reported by entertheyoni on 2013-02-17 00:02:02

Beep6581 commented 9 years ago
It should compile in Linux, because it doesn't use cuda, only OpenCl. But I don't know,
which dependencies are in Linux...

Reported by heckflosse@i-weyrich.de on 2013-02-17 00:13:44

Beep6581 commented 9 years ago
@36: Ingo, I can compile the source. Just let me know where i can download it.

Reported by willemtermeer on 2013-02-17 19:03:07

Beep6581 commented 9 years ago
@39: Here it is: http://www.i-weyrich.de/CL/Test_vec4.zip

Reported by heckflosse@i-weyrich.de on 2013-02-17 20:59:31

Beep6581 commented 9 years ago
I had no luck getting it to compile in Linux.
Does this make any sense to you?

Reported by entertheyoni on 2013-02-18 13:43:43


Beep6581 commented 9 years ago
I was able to compile it in Linux - but there is some error... This is the output:

PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll 
  -- 12126736 --
  DEVICE_NAME = GeForce GTX 560 Ti
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.1 CUDA
  DRIVER_VERSION = 304.64
  DEVICE_MAX_COMPUTE_UNITS = 8
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1800
  DEVICE_GLOBAL_MEM_SIZE = 1073283072
Error: Failed to build program executable with error -42!
ptxas application ptx input, line 75; error   : Call has wrong number of parameters
ptxas fatal   : Ptx assembly aborted due to error

Reported by zdenek.materna on 2013-02-18 14:03:32

Beep6581 commented 9 years ago
@42: please download again. I updated the File.

Reported by heckflosse@i-weyrich.de on 2013-02-18 14:21:52

Beep6581 commented 9 years ago
Ok, thanks. Now I'm able to get result:

100 iterations took 580000 clicks (0.580000 seconds).
Result[0] = 0.000000
nans are handled well

Reported by zdenek.materna on 2013-02-18 14:28:35

Beep6581 commented 9 years ago
DrSlony, could you try with g++ instead of gcc?

Reported by heckflosse@i-weyrich.de on 2013-02-18 14:31:29

Beep6581 commented 9 years ago
@44: And it'll get faster at your GPU, with Vector-Size 1, http://www.i-weyrich.de/CL/Test_scalar.zip

Reported by heckflosse@i-weyrich.de on 2013-02-18 14:36:20

Beep6581 commented 9 years ago
Nice improvement :)

100 iterations took 160000 clicks (0.160000 seconds).
Result[0] = 0.000000

Reported by zdenek.materna on 2013-02-18 15:29:07

Beep6581 commented 9 years ago
Could some of the Linux users please post directions on how to compile this test,if
it's not too much of an hassle ? Thanks

Reported by msth67 on 2013-02-18 18:05:30

Beep6581 commented 9 years ago
Laptop, Intel(R) Core(TM) i7 CPU Q 820  @ 1.73GHz, GeForce GTX 285M, nvidia-drivers
313.18
http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-285m/specifications

./test_vec4 
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll 
  -- 12126736 --
  DEVICE_NAME = GeForce GTX 285M
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.0 CUDA
  DRIVER_VERSION = 313.18
  DEVICE_MAX_COMPUTE_UNITS = 16
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1500
  DEVICE_GLOBAL_MEM_SIZE = 1073414144
100 iterations took 49460000 clicks (49.459999 seconds).
Result[0] = 0.000000
nans are handled well

./test_scalar 
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.1 CUDA 4.2.1
  NAME = NVIDIA CUDA
  VENDOR = NVIDIA Corporation
  EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll 
  -- 12126736 --
  DEVICE_NAME = GeForce GTX 285M
  DEVICE_VENDOR = NVIDIA Corporation
  DEVICE_VENDOR_ID = 4318
  DEVICE_VERSION = OpenCL 1.0 CUDA
  DRIVER_VERSION = 313.18
  DEVICE_MAX_COMPUTE_UNITS = 16
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 1500
  DEVICE_GLOBAL_MEM_SIZE = 1073414144
100 iterations took 4960000 clicks (4.960000 seconds).
Result[0] = 0.000000
nans are handled well

Reported by entertheyoni on 2013-02-18 22:44:56

Beep6581 commented 9 years ago
@ #48:
g++ -O3 -lOpenCL -fopenmp -c main.cc -o main.o
g++ -o test main.o /usr/lib64/libOpenCL.so /usr/lib64/gcc/x86_64-pc-linux-gnu/4.6.3/libgomp.a

You can add some meaningful name to the executable ("test") in the second line, so
that its easy to distinguish between different methods, such as scalar and vec4.

@ #45:
Thank you Ingo! That did it.

Reported by entertheyoni on 2013-02-18 22:47:12

Beep6581 commented 9 years ago
Sheesh - sorry about #34... can't read :/

7850

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.2 AMD-APP (1084.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
  EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
  -- 12126736 --
  DEVICE_NAME = Pitcairn
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VENDOR_ID = 4098
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (1084.4)
  DRIVER_VERSION = 1084.4 (VM)
  DEVICE_MAX_COMPUTE_UNITS = 16
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 920
  DEVICE_GLOBAL_MEM_SIZE = 2147483648
100 iterations took 232 clicks (0.232000 seconds).
Result[0] = 0.000000
nans are handled well

Reported by hasse.schougaard on 2013-02-19 05:07:40

Beep6581 commented 9 years ago
Hi Ingo, this is not OpenCL, but an interesting example of demosaic using CUDA:
http://www-hagi.ist.osaka-u.ac.jp/research/papers/201207_i_faruqi_hpcs.pdf

Reported by michaelezra000 on 2013-03-04 12:40:36

Beep6581 commented 9 years ago

Hi
Seems like I am a bit late to the party :)

I made one minor tweak to the source as my GPU seemed a bit quick for the clock ticks
(I got 0.000000 and 0.020000 at different rounds) so I just increased the loop to 1000.

I also did one extra tweak to the vec4 and placed that in local memory instead which
also led me to change the total size as it has to be evenly divided by the workunit
size(made it slightly larger from 12126736 to 12126976), that made an improvement from
0.15s to 0.1s for 1000 iterations, but that might not be completely true due to the
time meassured here

The files for the local memory change is attached

The reason for my local memory change/test was because I found this website: http://www.evl.uic.edu/kreda/gpu/image-convolution/
that has a study of an image convulution kernel and the optimization of it.

And my testresults where as follows:

Test scalar:
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 1.2 AMD-APP (1113.2)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
  EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
  -- 12126736 --
  DEVICE_NAME = Tahiti
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VENDOR_ID = 4098
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (1113.2)
  DRIVER_VERSION = 1113.2 (VM)
  DEVICE_MAX_COMPUTE_UNITS = 28
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
  DEVICE_MAX_CLOCK_FREQUENCY = 900
  DEVICE_GLOBAL_MEM_SIZE = 2147483648
1000 iterations took 360000 clicks (0.360000 seconds).
Result[0] = 0.000000
nans are handled well

Test_vec4:
1000 iterations took 150000 clicks (0.150000 seconds).

Test_vec4_local:
1000 iterations took 100000 clicks (0.100000 seconds).

So, count me in to the tests of OpenCL in RT, and if I get around to do some more testing
in OpenCL and looking in to the RT code I'll try to help out some more as well!

/Reine

Reported by reine.edvardsson on 2013-07-08 20:24:44