clMathLibraries / clSPARSE

a software library containing Sparse functions written in OpenCL
Apache License 2.0
173 stars 61 forks source link

Csr 2 Dense performance results #134

Closed jpola closed 9 years ago

jpola commented 9 years ago

I suspect that Csr 2 dense performance results are wrong,

Example: tomography.mtx.

clSPARSE matrix: /home/jpola/Projects/ClMath_ClSparse/clSPARSE-build/Externals/MTX/Small/tomography/tomography.mtx

========================StdDev ( 3 )========================
CPU xCsr2Dense[ 0 ]: Pruning 1 samples out of 50

=======================CPU xCsr2Dense=======================
   GiElements/s:                      0.176366
     Time (ns):                                      162,877

========================StdDev ( 3 )========================
GPU xCsr2Dense[ 0 ]: Pruning 1 samples out of 50

=======================GPU xCsr2Dense=======================
     OutEvents:            0xe580f0
   GiElements/s:                       2.64508
     Time (ns):                                       10,860

From the graph the GiElements/s for tomography is 0.179 Somebody I think, might took CPU results instead of GPU.

Similar result for Margal_6.mtx: CPU: 0.0226612, GPU: 1.60151, GRAPH: 0.059.

The results I obtained are generated by R270 I would be supprised if the difference between Hawaii will be so big.

kknox commented 9 years ago

@jpola I think I need clarification; you are talking about our graph on our perf wiki? Your results show .176, and the one on wiki is .179. We are taking the performance from the CPU timer, which is actually the time spent in the API. I might be misunderstanding your point.

Btw, with the PR #135 now merged, I assume that need to generate new graphs now?

jpola commented 9 years ago

It might be.

Could you please explain me then shortly what is the difference between the CPU Timer and GPU timer?

kknox commented 9 years ago

Admittedly, it's odd nomenclature; the CPU timer is using a timing mechanism based on host side. It's synchronous in nature, so when you call start/stop, it happens immediately. I use this timer to wrap the API under question, so in our benchmarks it returns the total time spent in the API.
The GPU timer is using OpenCL's clGetEventProfilingInfo, which uses events to query opencl to return the time spent in a kernel. It's asynchronous in nature, meaning it is not valid to query for timing information until after the 'kernel' has already completed execution. For this reason, its necessary for this timer to keep references to the event handles until after we call clFinish(). Unfortunately, my solution is only half finished; I do not yet fully support API routines that call more than a single GPU kernel. That is why, the GPU timer result accurate for our spm-dv routine, but not accurate for the solver routines.
The GPU timer is a precise way of measuring kernel time. If you take the CPU timer result, and subtract out the time spent in the kernel with the GPU timer, the time remaining is the efficiency you lose with the time spent in the host & run-time code.

jpola commented 9 years ago

Ok, I think I get the idea. CPU Timer measure the full time of given API function. Does the GPU Timer measure only the kernels time included in this API? Does the GPU Timer measure also mem cpy functions?

2015-09-01 21:35 GMT+02:00 Kent Knox notifications@github.com:

Admittedly, it's odd nomenclature; the CPU timer is using a timing mechanism based on host side. It's synchronous in nature, so when you call start/stop, it happens immediately. I use this timer to wrap the API under question, so in our benchmarks it returns the total time spent in the API.

The GPU timer is using OpenCL's clGetEventProfilingInfo, which uses events to query opencl to return the time spent in a kernel. It's asynchronous in nature, meaning it is not valid to query for timing information until after the 'kernel' has already completed execution. For this reason, its necessary for this timer to keep references to the event handles until after we call clFinish(). Unfortunately, my solution is only half finished; I do not yet fully support API routines that call more than a single GPU kernel. That is why, the GPU timer result accurate for our spm-dv routine, but not accurate for the solver routines.

The GPU timer is a precise way of measuring kernel time. If you take the CPU timer result, and subtract out the time spent in the kernel with the GPU timer, the time remaining is the efficiency you lose with the time spent in the host & run-time code.

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clSPARSE/issues/134#issuecomment-136835746 .

kknox commented 9 years ago

Yes, the idea is that anything that gives you an opencl event should be query-able for execution time with clGetEventProfilingInfo(), but the ability to store the events for multiple different kernels in a given API still needs to be finished. Right now, the GPU timer class assumes that each API has 1 GPU kernel, so anything more than that confuses it.

Eventually, for API's that have multiple kernels, I would like the ability to report each individual kernel time separately (so we can pinpoint which individual kernel is a bottleneck), and print them labelled like 'memcpy', 'sort', 'reducebykey', 'coo2csr'.

jpola commented 9 years ago

Ok, thanks for explanation.