PSO function optimization

This issue will be used to document the optimization of the PSO function.

Information was obtained from Intel VTune Profiler after tracking 20 frames (10 mc3, 10 rad) of the WN00105 dataset.

Initial Results

As we can see the worst offender during the PSO function run is the NCC function: https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L186-L226

We can also see that the majority of the run time of the NCC function is from the NCC_SUM function (which should be expected since we are making 6 calls to the sum function during a single run of the ncc function). We can also see that the sum function is dominated by calls for launching the kernel and the Buffer write method. WIthin the kernel launch function calls to the OpenCL API are made, with the calls to clEnqueueNDRangeKernel taking up ~5.196 seconds and calls to clFinish taking up ~4.330 seconds. As for buffer writes, this is a little counter-intuitive because we are actually reading from the buffer and writing to the variable we passed in. But all of the runtime for Buffer::write is caused by clEnqueueReadBuffer command.

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L106-L153

Moving forward

Buffer::Write

I would suggest removing the unnecessary call to Buffer::Write from within the ncc_sum driver:

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L150-L152

That way we can keep the sums on the GPU, since they are used by the regular NCC kernel. The calculation of meanG and meanF can take place in the NCC kernel

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L188-L190

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L205-L213

We will still need to read from the summation buffer once the execution of the regular ncc kernel is over, to return the expected float value. This will reduce our calls to Buffer::write during the ncc driver from 6 to 3. We could also investigate the viability of utilizing a kernel to calculate the den value, further reducing our calls from 3 to 2.

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/Ncc.cpp#L219-L225

Kernel::Launch

I would suggest looking into improving the sum kernel somehow. ~~Currently, I am building OCLGrind to help aid in this investigation.~~ OCLGrind does not support OpenGL-OpenCL interop.

https://github.com/BrownBiomechanics/Autoscoper/blob/62a7679f6ebb6c80fa1a7b9c5ee38402093df77f/libautoscoper/src/gpu/opencl/kernel/NccSum.cl#L1-L26

Outside of the PSO function

It would appear that everything outside of the PSO function runs in a negligible time. NOTE: Anything in the stack that is in the format func@{memory address} are calls to functions from external libraries such as OpenCL, OpenGL, Microsoft Direct3D, Windows USER32, etc

BrownBiomechanics / Autoscoper