We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up ~5.196 seconds and calls to clFinish taking up ~4.330 seconds. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime for Buffer::write is caused by clEnqueueReadBuffer command.
That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of meanG and meanF can take place in the NCC kernel
We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to Buffer::write during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate the den value, further reducing our calls from 3 to 2.
I would suggest looking into improving the sum kernel somehow. Currently, I am building OCLGrind to help aid in this investigation. OCLGrind does not support OpenGL-OpenCL interop.
It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format func@{memory address} are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etc
This issue will be used to document the optimization of the PSO function.
Information was obtained from Intel VTune Profiler after tracking 20 frames (10 mc3, 10 rad) of the WN00105 dataset.
Initial Results
As we can see the worst offender during the PSO function run is the NCC function: https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L186-L226
We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up
~5.196 seconds
and calls to clFinish taking up~4.330 seconds
. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime forBuffer::write
is caused by clEnqueueReadBuffer command.https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L106-L153
Moving forward
Buffer::Write
I would suggest removing the unnecessary call to
Buffer::Write
from within the ncc_sum driver:https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L150-L152
That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of
meanG
andmeanF
can take place in the NCC kernelhttps://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L188-L190
https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L205-L213
We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to
Buffer::write
during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate theden
value, further reducing our calls from 3 to 2.https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L219-L225
Kernel::Launch
I would suggest looking into improving the sum kernel somehow.
Currently, I am building OCLGrind to help aid in this investigation.OCLGrind does not support OpenGL-OpenCL interop.https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/kernel/NccSum.cl#L1-L26
Outside of the PSO function
It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format
func@{memory address}
are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etc