Extend-Robotics / camera_aravis

A ROS1 driver for GenICam based GigE and USB3 cameras.
Other
1 stars 1 forks source link

gcc autovectorization for YCoCb-R pixel format conversion #15

Closed bmegli closed 1 year ago

bmegli commented 1 year ago

As mentioned in:

Our code is now SIMD friendly

Some docs:

Quick intro:

Adding to CMake

 set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -fopt-info-vec-optimized -mavx")

We can can enable auto vectorization and see what was vecotrized

bmegli commented 1 year ago

We need to filter out error output to see the gcc notes

catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)

With the code as is this gives:

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:48:21: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:351:42: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:353:1: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:1279:3: optimized: basic block part vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/camera_aravis_nodelet.cpp:2035:28: optimized: basic block part vectorized using 16 byte vectors
bmegli commented 1 year ago

The only vectorized was

void shift(uint16_t* data, const size_t length, const size_t digits) {
  for (size_t i=0; i<length; ++i) {
    data[i] <<= digits;
  }
}
bmegli commented 1 year ago

The reason our function was not vectorized is Photoneo deviation from the standard

  //Photoneo specific:
  //Black pixels are treated specially in order to prevent
  //artifacts in images containing valid pixels only in a subregion

  if(!y)
  {
    bgra[0] = bgra[1] = bgra[2] = bgra[3] = 0;
    return;
  }

After commenting out this block of code

catkin_make -DCMAKE_BUILD_TYPE=Release 2> >(grep camera_aravis)

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:584:30: optimized:  loop versioned for vectorization because of possible aliasing

Which corresponds to our photoneoYCoCgR420 vectorized with 32 byte vectors (AVX512)

bmegli commented 1 year ago

This yields:

image

bmegli commented 1 year ago

In performance mode this yields

image

bmegli commented 1 year ago

The reason our function was not vectorized is Photoneo deviation from the standard

  • zeroing out pixels if Y is 0
  • which breaks control flow of loop

If the Photoneo deviation from the standard is necessary:

bmegli commented 1 year ago

Putting Y check as last and without control flow change (return)

  //Photoneo specific:
  //Black pixels are treated specially in order to prevent
  //artifacts in images containing valid pixels only in a subregion

  //Has to be written this way to enable gcc autovectorization
  //Do not change or move to beginning with return statement
  if(!y)
    bgra[0] = 0, bgra[1] = 0, bgra[2] = 0, bgra[3] = 0;    
}

We can again generate AVX code

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized: loop vectorized using 32 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:583:30: optimized:  loop versioned for vectorization because of possible aliasing
bmegli commented 1 year ago

Notably code with Y check performs a lot worse

image

And performance

image

bmegli commented 1 year ago

After rewriting all int as int16_t (this is the real integer range in algorithm)

image

bmegli commented 1 year ago

Summary

Photoneo deviation from the standard zeroying pixel if y==0

After rewriting loop in such way that the control flow is not modified by this check:

Powersave

No optimization Optimized Autovectorized without y==0 check Autovectorized with y==0 check
up to 30 ms up to 11 ms 2-3 ms Up to 7-8 ms

Performance

The code will not be running in performance mode, this is for reference

No optimization Optimized Autovectorized without y==0 check Autovectorized with y==0 check
7 ms 3 ms 1 ms 2 ms

Last

We are lively interested in Photoneo answering

Question about y==0 check

bmegli commented 1 year ago

We got confirmation from Photoneo

In that case we need rework implementation a bit so that it:

bmegli commented 1 year ago

In that case we need rework implementation

We can restate the y==0 check as SIMD friendly multiplication while transferring pixels from ycocg-r to rgb

        // transfer YCoCg-R to BGRA8

        ////Photoneo specific:
        ////Black pixels are treated specially in order to prevent
        ////artifacts in images containing valid pixels only in a subregion
        ////See: https://github.com/photoneo-3d/photoneo-cpp-examples/issues/4#issuecomment-1660578655

        ////Our implementation specific:
        ////(yij != 0) multiplications zero out YCoCb-R if y sample is 0 protecting black pixels
        ////at the same time and keeping high performance SIMD autovectorization
        ////See: https://github.com/Extend-Robotics/camera_aravis/issues/15
        ycocgr_to_bgra8(bgra, y00, (y00 != 0) * csc_co, (y00 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_PIXEL_OFFSET, y01, (y01 != 0) * csc_co, (y01 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_STRIDE, y10, (y10 != 0) * csc_co, (y10 != 0) * csc_cg);
        ycocgr_to_bgra8(bgra + RGB_STRIDE + RGB_PIXEL_OFFSET, y11, (y11 != 0) * csc_co, (y11 != 0) * csc_cg);

Powersafe

image

Performance

image

PhoXiControl

Artifacts in pencil texture projection are not due to our implementation but come from device computed texture

image

bmegli commented 1 year ago

Made sure autovectorization also happens with default release flags

In that case it is vecotorized with

/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized: loop vectorized using 16 byte vectors
/home/meglickib/catkin_ws/src/camera_aravis/src/conversion_utils.cpp:582:30: optimized:  loop versioned for vectorization because of possible aliasing

It is reasonably efficient.

32 byte vectors are used with -march=native