bmegli commented 1 year ago

As mentioned in:

Our code is now SIMD friendly

Some docs:

https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Quick intro:

https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization

Adding to CMake

 set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -fopt-info-vec-optimized -mavx")

We can can enable auto vectorization and see what was vecotrized

bmegli commented 1 year ago

We need to filter out error output to see the gcc notes

catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)

With the code as is this gives:

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:48:21: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:351:42: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:353:1: optimized: basic block part vectorized using 32 byte vectors

/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:1279:3: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:2035:28: optimized: basic block part vectorized using 16 byte vectors

bmegli commented 1 year ago

The only vectorized was

void shift(uint16_t* data, const size_t length, const size_t digits) {
  for (size_t i=0; i<length; ++i) {
    data[i] <<= digits;
  }
}

bmegli commented 1 year ago

The reason our function was not vectorized is Photoneo deviation from the standard

zeroing out pixels if Y is 0
which breaks control flow of loop

  //Photoneo specific:
  //Black pixels are treated specially in order to prevent
  //artifacts in images containing valid pixels only in a subregion

  if(!y)
  {
    bgra[0] = bgra[1] = bgra[2] = bgra[3] = 0;
    return;
  }

After commenting out this block of code

catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized:  loop versioned for vectorization because of possible aliasing

Which corresponds to our photoneoYCoCgR420 vectorized with 32 byte vectors (AVX512)

bmegli commented 1 year ago

This yields:

2-3 ms conversion in powersave mode
compare with
- https://github.com/Extend-Robotics/camera_aravis/issues/13#issuecomment-1656759114
- up to 30 ms original
- https://github.com/Extend-Robotics/camera_aravis/issues/13#issuecomment-1656759631
- up to 11 ms

bmegli commented 1 year ago

In performance mode this yields

around 1 ms
compare with:
- 7 ms for original, 3 ms for optimized
- https://github.com/Extend-Robotics/camera_aravis/issues/13#issuecomment-1658243084

bmegli commented 1 year ago

The reason our function was not vectorized is Photoneo deviation from the standard

zeroing out pixels if Y is 0

which breaks control flow of loop

If the Photoneo deviation from the standard is necessary:

we need to restructure the code to avoid breaking loop control flow

bmegli commented 1 year ago

Putting Y check as last and without control flow change (return)

  //Photoneo specific:
  //Black pixels are treated specially in order to prevent
  //artifacts in images containing valid pixels only in a subregion

  //Has to be written this way to enable gcc autovectorization
  //Do not change or move to beginning with return statement
  if(!y)
    bgra[0] = 0, bgra[1] = 0, bgra[2] = 0, bgra[3] = 0;    
}

We can again generate AVX code

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized:  loop versioned for vectorization because of possible aliasing

bmegli commented 1 year ago

Notably code with Y check performs a lot worse

up to 7-8 ms in powersave vs 2-3 ms

And performance

2 ms vs 1 ms
but we will not be running in performance

bmegli commented 1 year ago

After rewriting all int as int16_t (this is the real integer range in algorithm)

in powersave this yields up to 7-8 ms (not much change)
- although I remember from my tests that non vectorized version is just a bit slower this way
we are going to keep this version

bmegli commented 1 year ago

Summary

Photoneo deviation from the standard zeroying pixel if y==0

prevents autovectorization

After rewriting loop in such way that the control flow is not modified by this check:

autovectorization happens
but is a lot less efficient compared to version that follows standard

Powersave

No optimization	Optimized	Autovectorized without y==0 check	Autovectorized with y==0 check
up to 30 ms	up to 11 ms	2-3 ms	Up to 7-8 ms

Performance

The code will not be running in performance mode, this is for reference

No optimization	Optimized	Autovectorized without y==0 check	Autovectorized with y==0 check
7 ms	3 ms	1 ms	2 ms

Last

We are lively interested in Photoneo answering

https://github.com/photoneo-3d/photoneo-cpp-examples/issues/4

Question about y==0 check

bmegli commented 1 year ago

We got confirmation from Photoneo

https://github.com/photoneo-3d/photoneo-cpp-examples/issues/4#issuecomment-1660301722
that y==0 checks are strictly necessary to protect black pixels in 4:2:0 subsampling

In that case we need rework implementation a bit so that it:

autovectorizes
keeps performance of implementation witthout y==0 check

bmegli commented 1 year ago

In that case we need rework implementation

We can restate the y==0 check as SIMD friendly multiplication while transferring pixels from ycocg-r to rgb

        // transfer YCoCg-R to BGRA8

        ////Photoneo specific:
        ////Black pixels are treated specially in order to prevent
        ////artifacts in images containing valid pixels only in a subregion
        ////See: https://github.com/photoneo-3d/photoneo-cpp-examples/issues/4#issuecomment-1660578655

        ////Our implementation specific:
        ////(yij != 0) multiplications zero out YCoCb-R if y sample is 0 protecting black pixels
        ////at the same time and keeping high performance SIMD autovectorization
        ////See: https://github.com/Extend-Robotics/camera_aravis/issues/15
        ycocgr_to_bgra8(bgra, y00, (y00 != 0) * csc_co, (y00 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_PIXEL_OFFSET, y01, (y01 != 0) * csc_co, (y01 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_STRIDE, y10, (y10 != 0) * csc_co, (y10 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_STRIDE + RGB_PIXEL_OFFSET, y11, (y11 != 0) * csc_co, (y11 != 0) * csc_cg);

Powersafe

back to 2-3 ms

Performance

back to ~ 1 ms

PhoXiControl

Artifacts in pencil texture projection are not due to our implementation but come from device computed texture

bmegli commented 1 year ago

Made sure autovectorization also happens with default release flags

In that case it is vecotorized with

16 byte vectors (vs 32 byte vectors)
which is AVX128

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized: loop vectorized using 16 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized:  loop versioned for vectorization because of possible aliasing

It is reasonably efficient.

32 byte vectors are used with -march=native

Extend-Robotics / camera_aravis

gcc autovectorization for YCoCb-R pixel format conversion #15

Summary

Powersave

Performance

Last

Powersafe

Performance

PhoXiControl