Open Beep6581 opened 9 years ago
Hi Ingo, this is extremely exciting for this project. I have at my disposal GTX 480
and i7 3930K with 64GB ram and would be glad to participate.
Regarding the new branch, it would be smoother to create it after tif32 merges into
default. Another branch, xmp, was in progress also, may be Hombre could suggest how
to coordinate this with the least impact.
Reported by michaelezra000
on 2013-01-25 04:37:40
Hi Ingo
I am interested in this topic, for me (and I think for others)it will be a new challenge.
My configuration:
Corei7 - 8M°
GeForce GT 650M
Reported by jdesmis
on 2013-01-25 08:07:22
Hi Ingo,
Thanks for all your help in making RT better (and faster). I continue to have very
little time for coding or testing, but would be happy to answer any questions about
the code that I authored.
It would be great if AMaZE demosaic and the NR module could be implemented on the GPU.
I suspect it might be better if AMaZE was restructured at the same time -- quite a
few times in the code there are calculations made and stored in memory as intermediate
results for later use, because I found that it was faster than calculating them as
needed. I suspect the balance may tip toward calculating when needed on the GPU, where
memory is more of a premium.
I agree though that the first step is to convert some of the more general purpose tools
like gaussian blur and box blur.
Reported by ejm.60657
on 2013-01-25 15:16:50
Hi Jaques, Hi Micheal,
I'm very glad about your answers :-)
I agree with you, Micheal, regarding the new branch.
It's also a new challenge for me. I've read a lot of docs. Can recommend the book 'OpenCl
in Action'. The Author also provides the examples from his book free for everyone.
Just search 'manning opencl' and you'll find them.
Very nice, that we have 3 different GPUs.
I'll send you my dcdamping test-kernel per mail, so you've something to play with ;-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-25 16:12:43
@comment3: I chose rl-deconv, because I'm also new to OpenCl and implementing this part
in OpenCl is not so difficult. In fact, dcdamping was super easy, but gauss will be
a mit more work, because the Young-van Vliet Implementation can't be used at GPU as
easy as the dcdamping. But I've had a look at the papers to 'Young-van Vliet Implementation'
and we can simply replace it by normal gaussian blur on GPU.
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-25 16:22:37
The nice thing about the Young-van Vliet approach is that it's an IIR filter rather
than an FIR filter, so the time cost is independent of the radius of the Gaussian blur.
Normal implementations with an FIR filter have a time cost that is substantial for
large radius blurs, at least for implementations on standard CPU's, but maybe that
is not an issue for GPU's. At least, the IIR filter approach is not well adapted to
2d parallelization, since the filter is cumulative as you go down a row/column (ie
the filter on a pixel depends on the result of the calculation for the previous pixel
on a row/column). One can however clearly parallelize the calculations by doing all
rows/columns in parallel even if within a row/column there is not a parallelization
possible. Anyway, what the GPU will do is change the point where it is faster to do
the IIR Y-vV filter as opposed to the standard FIR gaussian filter; I will be curious
to see where that point is, and if the FIR approach is always faster except for ridiculously
large blur radii (as for instance might be used in a S/H tool).
Reported by ejm.60657
on 2013-01-25 17:09:38
I'll try both :-)
Reported by heckflosse@i-weyrich.de
on 2013-01-25 19:05:59
Me again. Forgot to mention following docs, which are very interesting to OpenCl newbies,
especially for Nvidia GPUs:
http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf
http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf
Would be glad, if there's somebody with an AMD GPU, at least for testing...
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-25 23:20:14
Ingo, there is a code sample for recursiveGaussian
"This sample implements a Gaussian blur using Deriche's recursive method. The advantage
of this method is that the execution time is independent of the filter width."
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\recursiveGaussian
could this be used for our purposes?
There is also an interesting example of denoising on GPU:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\imageDenoising
This uses K Nearest Neighbors and Non Local Means filters, including fast version if
the latter.
On Windows sample application can be run from
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release\imageDenoising.exe
PDF paper:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\3_Imaging\imageDenoising\doc\imageDenoising.pdf
(by Alexander Kharlamov, Victor Podlozhnyuk)
Emil, do you know how would this method compare to RT's?
Reported by michaelezra000
on 2013-01-27 16:52:26
Hi Michael, these are CUDA samples, but they'll be transformable to OpenCl.
But I think, it would be better, to make OpenCL-Versions of the things, we actually
use in RT, because e.g. when we use Deriche's recursive method instead of Y-vV, the
results could be different and we would waste a lot of time, to describe why they are
different, or to make them equal. Another point is, that, when using methods already
used in RT, we know, that they're under 'GNU General Public License'.
Btw: I'm actually working on OpenCL version of Y-vV-Implementation.
Did you get my example of dcdamping working?
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-27 18:07:35
I was able to build the exe (thanks for the email!)
Here is the output:
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_
query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GTX 480
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.1 CUDA
DRIVER_VERSION = 306.94
DEVICE_MAX_COMPUTE_UNITS = 15
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1401
DEVICE_GLOBAL_MEM_SIZE = 1610612736
Error: Failed to create compute program!
Reported by michaelezra000
on 2013-01-27 20:42:51
Hi Michael,
it seems, that the the dcdamping.cl is not in the same directory as the main.cc.
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-27 21:29:41
after I placed dcdamping.cl next to the exe file (not the main.cc), I got this output:
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_
query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GTX 480
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.1 CUDA
DRIVER_VERSION = 306.94
DEVICE_MAX_COMPUTE_UNITS = 15
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1401
DEVICE_GLOBAL_MEM_SIZE = 1610612736
100 iterations took 180 clicks (0.180000 seconds).
Result[0] = 0.000000
nans are handled well
Reported by michaelezra000
on 2013-01-27 21:52:29
Congrats, Michael! Your GPU is pretty fast. That means, your GPU takes 0,18 seconds
to make 100 dcdamping iterations on the Lab-Part of a 12 MPix-Image. My GPU takes 0.827
seconds for the the same amount of work and my 8-Core-CPU takes about 20 seconds.
The last two lines were only for testing of nan-handling. On a GPU, you can divide
be zero, without getting exceptions, and can handle this later by checking, wether
the result is Nan, which I did to get better performance in the dcdamping-kernel.
Glad, you're on board :-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-01-27 23:08:59
Can i run this test too?
Reported by natureh.510
on 2013-01-27 23:44:24
wow... I don't think I will need to upgrade my CPU for a good few years, GPU has to
be the future:)
I put together a small archive with all components to get it rolling a bit easier on
windows:
http://filebin.net/upload/7du3q58iw5
instructions are in _Readme.txt file - there is source code from Ingo, a batch file
to build, dependency libopencl.a and the compiled executable.
Reported by michaelezra000
on 2013-01-28 01:40:05
Thanks a lot, Michael!
Reported by heckflosse@i-weyrich.de
on 2013-01-28 18:58:38
And here is the result for my laptop.
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing
cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GT 630M
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.1 CUDA
DRIVER_VERSION = 306.94
DEVICE_MAX_COMPUTE_UNITS = 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1344
DEVICE_GLOBAL_MEM_SIZE = 2147483648
100 iterations took 875 clicks (0.875000 seconds).
Result[0] = 0.000000
nans are handled well
System: Win7 / Core i7 / 12Gb RAM
Reported by natureh.510
on 2013-01-29 19:59:56
...\OpenCL_test01\bin\Debug>OpenCL_test01x64.exe
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.2 AMD-APP (938.2)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3
d10_sharing
-- 12126736 --
DEVICE_NAME = Juniper
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VENDOR_ID = 4098
DEVICE_VERSION = OpenCL 1.2 AMD-APP (938.2)
DRIVER_VERSION = CAL 1.4.1741 (VM)
DEVICE_MAX_COMPUTE_UNITS = 9
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 4
DEVICE_MAX_CLOCK_FREQUENCY = 800
DEVICE_GLOBAL_MEM_SIZE = 1073741824
100 iterations took 7356 clicks (7.356000 seconds).
Result[0] = 0.000000
nans are handled well
Ran the attached exe on: Win7 / Core i7 / 16b RAM / Radeon HD6700
Reported by 121nilsson
on 2013-01-29 20:25:57
Fine, seems to be some interest in this thing :-)
@ #19: Your Nvidia 630M seems to perform almost like my 640GT
@ #20: Your Juniper Device doesn't seem to perform well with my dcdamping-kernel, but
it's even 3 times faster than my CPU . But your device has 'CL_DEVICE_PREFERRED_VECTOR_WIDTH_float
= 4', so I'll make a sample, which should speed up at your machine. In fact, I'm very
glad to have somebody on board with a different GPU. Most of us have Nvidia, I think.
Reported by heckflosse@i-weyrich.de
on 2013-01-29 22:13:38
I apologize for such basic questions,but this looks very interesting and I'd like to
give it a try -if at all possible- with my very minimal understanding of RT:
-will this test run on an AMD (SAPPHIRE HD 5450 1GB DDR3) graphic card ?
-is it possible to do this on Linux ?
Reported by msth67
on 2013-01-30 14:36:54
#23: You don't need to have any understanding of RT for this simple test.
OpenCl should work with your card, but I don't know, what to install, because I've
a Nvidia Card. Maybe 121nilsson can help you?
Reported by heckflosse@i-weyrich.de
on 2013-01-30 16:01:21
#23. Download the zip-file in post #17 from michael and unzip and run the .exe ...OpenCL_test01\bin\Debug>OpenCL_test01x64.exe
from cmd prompt.
Reported by 121nilsson
on 2013-01-30 16:35:37
#25: I guess, he is using Linux...
Reported by heckflosse@i-weyrich.de
on 2013-01-30 18:29:33
Correct,I'm on Linux and the test contained in the zip file doesn't look like it will
run on my computer.
Reported by msth67
on 2013-01-30 18:56:33
The test also contains the sources. But I don't know how to setup your system to compile
the example on Linux. Sorry :-(
Reported by heckflosse@i-weyrich.de
on 2013-01-30 23:03:16
Sorry, was tired.. I only use/know windows and visual studio.
Reported by 121nilsson
on 2013-02-01 19:07:44
Hi Ingo,
Sorry if I disturb you but I would like to know how to choose a GPU which will be fast
with OpenCL accelerated RT.
Is this benchmark indicative of the performance OpenCL RT will have on various GPUs
http://www.tomshardware.com/charts/2012-vga-gpgpu/15-GPGPU-Luxmark,2971.html
Reported by iliasgiarimis
on 2013-02-02 00:53:42
I don't know, because I've only one graphics card. But you'll have a lot of time, to
decide, which one is best, because we're just at the beginning of OpenCL RT.
Reported by heckflosse@i-weyrich.de
on 2013-02-02 13:36:06
Some of the most important, but generally overlooked, aspects of the video card is the
memory bandwidth and the number of CUDA cores. In the comparison (comment 30) Nvidia
cards feel highly misrepresented. I purchased GTX-480 card in Summer 2012 and then
it was the most cost efficient GPU. For $200 I had 50-60% of a $1000 card, I think
it was a just released then GTX-690. I decided to not to purchase the highest end GPU
cards as they depreciate fast, but still have only a few GB or memory. I hope that
in a few years we will see GPUs with 32GB+ RAM and software to utilize it.
GTX-480 is the hottest component in my PC. i7 is water cooled , but it never gets hot.
GTX-480 is hot even when idle, but it can take it. There are GPU water cooling options
as well. Newer GPU models have smaller transistor size and are more power efficient.
GPU is evolving fast and my guess is there should be another product now which is in
a similar cost/performance niche vs the high $ end as GTX-480 in summer 2012.
Reported by michaelezra000
on 2013-02-02 18:13:34
AMD 7850 owner here. No exe available at the link in #17 anymore. Let me know how I
can help.
Reported by hasse.schougaard
on 2013-02-13 22:10:17
I ran the test om my budget computer.
Win7 64-bit/ AMD ATHLON II 4-core / 4b RAM / Radeon HD6450
C:\compile\OpenCL>C:\compile\OpenCL\bin\Debug\opencl_x64.exe
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.2 AMD-APP (1084.4)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3
d10_sharing cl_khr_d3d11_sharing
-- 12126736 --
DEVICE_NAME = Caicos
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VENDOR_ID = 4098
DEVICE_VERSION = OpenCL 1.2 AMD-APP (1084.4)
DRIVER_VERSION = 1084.4 (VM)
DEVICE_MAX_COMPUTE_UNITS = 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 4
DEVICE_MAX_CLOCK_FREQUENCY = 625
DEVICE_GLOBAL_MEM_SIZE = 536870912
100 iterations took 33136 clicks (33.136002 seconds).
Result[0] = 0.000000
nans are handled well
The results are not very impressive compared with the high end intel i7 machines.
Reported by willemtermeer
on 2013-02-16 20:26:47
That's in fact really bad, especially compared to #13.
But, if you are you able to compile the source, I'll make a version, which uses VECTOR_WIDTH
4. Would be very interesting, if that would speedup up at your system. One of my main
reasons, to make this little test was to gather informations, whether it's worth to
consider the vector-capabilities of the GPU.
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-02-16 21:01:30
Hi
Does the program compile and run in Linux? I ask before trying because getting cuda-5.0
requires a bit of tinkering in Gentoo 64 stable (4.2.9 is the latest stable) so I'd
like to know whether it should work before I make a mess.
Reported by entertheyoni
on 2013-02-17 00:02:02
Started
It should compile in Linux, because it doesn't use cuda, only OpenCl. But I don't know,
which dependencies are in Linux...
Reported by heckflosse@i-weyrich.de
on 2013-02-17 00:13:44
@36: Ingo, I can compile the source. Just let me know where i can download it.
Reported by willemtermeer
on 2013-02-17 19:03:07
@39: Here it is: http://www.i-weyrich.de/CL/Test_vec4.zip
Reported by heckflosse@i-weyrich.de
on 2013-02-17 20:59:31
I had no luck getting it to compile in Linux.
Does this make any sense to you?
Reported by entertheyoni
on 2013-02-18 13:43:43
I was able to compile it in Linux - but there is some error... This is the output:
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GTX 560 Ti
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.1 CUDA
DRIVER_VERSION = 304.64
DEVICE_MAX_COMPUTE_UNITS = 8
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1800
DEVICE_GLOBAL_MEM_SIZE = 1073283072
Error: Failed to build program executable with error -42!
ptxas application ptx input, line 75; error : Call has wrong number of parameters
ptxas fatal : Ptx assembly aborted due to error
Reported by zdenek.materna
on 2013-02-18 14:03:32
@42: please download again. I updated the File.
Reported by heckflosse@i-weyrich.de
on 2013-02-18 14:21:52
Ok, thanks. Now I'm able to get result:
100 iterations took 580000 clicks (0.580000 seconds).
Result[0] = 0.000000
nans are handled well
Reported by zdenek.materna
on 2013-02-18 14:28:35
DrSlony, could you try with g++ instead of gcc?
Reported by heckflosse@i-weyrich.de
on 2013-02-18 14:31:29
@44: And it'll get faster at your GPU, with Vector-Size 1, http://www.i-weyrich.de/CL/Test_scalar.zip
Reported by heckflosse@i-weyrich.de
on 2013-02-18 14:36:20
Nice improvement :)
100 iterations took 160000 clicks (0.160000 seconds).
Result[0] = 0.000000
Reported by zdenek.materna
on 2013-02-18 15:29:07
Could some of the Linux users please post directions on how to compile this test,if
it's not too much of an hassle ? Thanks
Reported by msth67
on 2013-02-18 18:05:30
Laptop, Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz, GeForce GTX 285M, nvidia-drivers
313.18
http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-285m/specifications
./test_vec4
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GTX 285M
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.0 CUDA
DRIVER_VERSION = 313.18
DEVICE_MAX_COMPUTE_UNITS = 16
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1500
DEVICE_GLOBAL_MEM_SIZE = 1073414144
100 iterations took 49460000 clicks (49.459999 seconds).
Result[0] = 0.000000
nans are handled well
./test_scalar
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.1 CUDA 4.2.1
NAME = NVIDIA CUDA
VENDOR = NVIDIA Corporation
EXTENSIONS = cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options
cl_nv_device_attribute_query cl_nv_pragma_unroll
-- 12126736 --
DEVICE_NAME = GeForce GTX 285M
DEVICE_VENDOR = NVIDIA Corporation
DEVICE_VENDOR_ID = 4318
DEVICE_VERSION = OpenCL 1.0 CUDA
DRIVER_VERSION = 313.18
DEVICE_MAX_COMPUTE_UNITS = 16
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 1500
DEVICE_GLOBAL_MEM_SIZE = 1073414144
100 iterations took 4960000 clicks (4.960000 seconds).
Result[0] = 0.000000
nans are handled well
Reported by entertheyoni
on 2013-02-18 22:44:56
@ #48:
g++ -O3 -lOpenCL -fopenmp -c main.cc -o main.o
g++ -o test main.o /usr/lib64/libOpenCL.so /usr/lib64/gcc/x86_64-pc-linux-gnu/4.6.3/libgomp.a
You can add some meaningful name to the executable ("test") in the second line, so
that its easy to distinguish between different methods, such as scalar and vec4.
@ #45:
Thank you Ingo! That did it.
Reported by entertheyoni
on 2013-02-18 22:47:12
Sheesh - sorry about #34... can't read :/
7850
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.2 AMD-APP (1084.4)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
-- 12126736 --
DEVICE_NAME = Pitcairn
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VENDOR_ID = 4098
DEVICE_VERSION = OpenCL 1.2 AMD-APP (1084.4)
DRIVER_VERSION = 1084.4 (VM)
DEVICE_MAX_COMPUTE_UNITS = 16
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 920
DEVICE_GLOBAL_MEM_SIZE = 2147483648
100 iterations took 232 clicks (0.232000 seconds).
Result[0] = 0.000000
nans are handled well
Reported by hasse.schougaard
on 2013-02-19 05:07:40
Hi Ingo, this is not OpenCL, but an interesting example of demosaic using CUDA:
http://www-hagi.ist.osaka-u.ac.jp/research/papers/201207_i_faruqi_hpcs.pdf
Reported by michaelezra000
on 2013-03-04 12:40:36
Hi
Seems like I am a bit late to the party :)
I made one minor tweak to the source as my GPU seemed a bit quick for the clock ticks
(I got 0.000000 and 0.020000 at different rounds) so I just increased the loop to 1000.
I also did one extra tweak to the vec4 and placed that in local memory instead which
also led me to change the total size as it has to be evenly divided by the workunit
size(made it slightly larger from 12126736 to 12126976), that made an improvement from
0.15s to 0.1s for 1000 iterations, but that might not be completely true due to the
time meassured here
The files for the local memory change is attached
The reason for my local memory change/test was because I found this website: http://www.evl.uic.edu/kreda/gpu/image-convolution/
that has a study of an image convulution kernel and the optimization of it.
And my testresults where as follows:
Test scalar:
PROFILE = FULL_PROFILE
VERSION = OpenCL 1.2 AMD-APP (1113.2)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
-- 12126736 --
DEVICE_NAME = Tahiti
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VENDOR_ID = 4098
DEVICE_VERSION = OpenCL 1.2 AMD-APP (1113.2)
DRIVER_VERSION = 1113.2 (VM)
DEVICE_MAX_COMPUTE_UNITS = 28
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = 3
CL_DEVICE_MAX_WORK_GROUP_SIZE = 3
CL_DEVICE_PREFERRED_VECTOR_WIDTH_float = 1
DEVICE_MAX_CLOCK_FREQUENCY = 900
DEVICE_GLOBAL_MEM_SIZE = 2147483648
1000 iterations took 360000 clicks (0.360000 seconds).
Result[0] = 0.000000
nans are handled well
Test_vec4:
1000 iterations took 150000 clicks (0.150000 seconds).
Test_vec4_local:
1000 iterations took 100000 clicks (0.100000 seconds).
So, count me in to the tests of OpenCL in RT, and if I get around to do some more testing
in OpenCL and looking in to the RT code I'll try to help out some more as well!
/Reine
Reported by reine.edvardsson
on 2013-07-08 20:24:44
Originally reported on Google Code with ID 1694
Reported by
heckflosse@i-weyrich.de
on 2013-01-24 23:20:08