Closed bmegli closed 1 year ago
We need to filter out error output to see the gcc notes
catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)
With the code as is this gives:
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:48:21: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:351:42: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:353:1: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:1279:3: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:2035:28: optimized: basic block part vectorized using 16 byte vectors
The only vectorized was
void shift(uint16_t* data, const size_t length, const size_t digits) {
for (size_t i=0; i<length; ++i) {
data[i] <<= digits;
}
}
The reason our function was not vectorized is Photoneo deviation from the standard
//Photoneo specific:
//Black pixels are treated specially in order to prevent
//artifacts in images containing valid pixels only in a subregion
if(!y)
{
bgra[0] = bgra[1] = bgra[2] = bgra[3] = 0;
return;
}
After commenting out this block of code
catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized: loop versioned for vectorization because of possible aliasing
Which corresponds to our photoneoYCoCgR420
vectorized with 32 byte vectors (AVX512)
This yields:
In performance mode this yields
The reason our function was not vectorized is Photoneo deviation from the standard
- zeroing out pixels if Y is 0
- which breaks control flow of loop
If the Photoneo deviation from the standard is necessary:
Putting Y check as last and without control flow change (return)
//Photoneo specific:
//Black pixels are treated specially in order to prevent
//artifacts in images containing valid pixels only in a subregion
//Has to be written this way to enable gcc autovectorization
//Do not change or move to beginning with return statement
if(!y)
bgra[0] = 0, bgra[1] = 0, bgra[2] = 0, bgra[3] = 0;
}
We can again generate AVX code
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized: loop versioned for vectorization because of possible aliasing
Notably code with Y check performs a lot worse
And performance
After rewriting all int as int16_t (this is the real integer range in algorithm)
Photoneo deviation from the standard zeroying pixel if y==0
After rewriting loop in such way that the control flow is not modified by this check:
No optimization | Optimized | Autovectorized without y==0 check | Autovectorized with y==0 check |
---|---|---|---|
up to 30 ms | up to 11 ms | 2-3 ms | Up to 7-8 ms |
The code will not be running in performance mode, this is for reference
No optimization | Optimized | Autovectorized without y==0 check | Autovectorized with y==0 check |
---|---|---|---|
7 ms | 3 ms | 1 ms | 2 ms |
We are lively interested in Photoneo answering
Question about y==0
check
We got confirmation from Photoneo
y==0
checks are strictly necessary to protect black pixels in 4:2:0 subsamplingIn that case we need rework implementation a bit so that it:
y==0
checkIn that case we need rework implementation
We can restate the y==0
check as SIMD friendly multiplication while transferring pixels from ycocg-r to rgb
// transfer YCoCg-R to BGRA8
////Photoneo specific:
////Black pixels are treated specially in order to prevent
////artifacts in images containing valid pixels only in a subregion
////See: https://github.com/photoneo-3d/photoneo-cpp-examples/issues/4#issuecomment-1660578655
////Our implementation specific:
////(yij != 0) multiplications zero out YCoCb-R if y sample is 0 protecting black pixels
////at the same time and keeping high performance SIMD autovectorization
////See: https://github.com/Extend-Robotics/camera_aravis/issues/15
ycocgr_to_bgra8(bgra, y00, (y00 != 0) * csc_co, (y00 != 0) * csc_cg);
ycocgr_to_bgra8(bgra + RGB_PIXEL_OFFSET, y01, (y01 != 0) * csc_co, (y01 != 0) * csc_cg);
ycocgr_to_bgra8(bgra + RGB_STRIDE, y10, (y10 != 0) * csc_co, (y10 != 0) * csc_cg);
ycocgr_to_bgra8(bgra + RGB_STRIDE + RGB_PIXEL_OFFSET, y11, (y11 != 0) * csc_co, (y11 != 0) * csc_cg);
Artifacts in pencil texture projection are not due to our implementation but come from device computed texture
Made sure autovectorization also happens with default release flags
In that case it is vecotorized with
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized: loop vectorized using 16 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized: loop versioned for vectorization because of possible aliasing
It is reasonably efficient.
32 byte vectors are used with -march=native
As mentioned in:
Our code is now SIMD friendly
Some docs:
Quick intro:
Adding to CMake
We can can enable auto vectorization and see what was vecotrized