Adding SIMD support for math calculations

codetiger commented 4 years ago

Thanks for the wonderful project.

I was analysing the code out of curiosity and found that the underlying math does not use the advantages of SIMD and Vector co-processors in any hardware. I would like to know if there is any plans to add these supports. This would definitely make the library much faster. Though multi-threading and GPU support are in pipeline, SIMD will give upto 4x speed boost without much effort.

I would like to contribute to this if it is in scope of your project roadmap. Let me know your thoughts.

DanielChappuis commented 4 years ago

Of course, I would like to implement this in the future.

However, I am not sure how to do it properly. I would like to use a SIMD library that offers cross-platform compatibility and that is well-maintained. Do you have any ideas about this ?

codetiger commented 4 years ago

Yes, I've implemented this for my game engine from scratch in my past. Have also contributed little bit in adding support for ARM Neon in BulletPhysics which was later replaced by Apple's contribution for more latest co-processors. In my game engine, I added support for ARM Neon and SSE3 as the target platform was iOS and Mac.

In this case, we need to cover a lot more and ofcourse with a fall back with native implementation. As you have a beautiful project without any dependencies, am sure you will not agree to go with SIMD libraries.

About the approach, we need to add compiler flags for each implementation and add specific code in the Vector, Matrix and Quaternion classes. Take a look at this GLKMath code from Apple for example. https://github.com/codetiger/Iyan3d/blob/master/SGEngine2/Core/common/GLKMath/GLKVector4.h

codetiger commented 4 years ago

Forgot to mention, Eigen is a beautiful high-level C++ library of template headers for linear algebra, matrix and vector operations. If you are open to add such external libraries, we can evaluate a few.

DanielChappuis commented 4 years ago

Currently, I am trying to reduce cache misses and to improve algorithms. After this, the execution will be more compute bound and therefore SIMD will be the next step to improve the speed of the library.

Of course, if you want to take some time to try to use SIMD in the library, do not hesitate to test it on your side. I am really interested to see what could be the improvement.

mrakh commented 4 years ago

For SIMD support, I would highly recommend just using the what the compiler already provides. gcc, clang and msvc should all have built-in support for using AMD64's SSE/AVX intrinsics and ARM's NEON/SVE intrinsics with a simple #include. Any other architecture that might have SIMD is too exotic to be worth the effort anyways.

tay10r commented 2 years ago

I'd just like to chime in here and share my experience with SIMD.

SIMD will give upto 4x speed boost without much effort

If you're talking about SSE/NEON, it's significant effort. It's a huge pain adding compiler checks and handing the different ways compilers will support either instruction set. For example:

  /* Not all compilers will translate the + operator to _mm_add_ps */
  __m128 a = _mm_set_ps(1.0) + _mm_set_ps(2.0);

And if you're talking about trying to do something like this:

struct Vec3 final
{
  union {
    __m128 data;
    float x, y, z, unused;
  }
};

Don't, it doesn't work well for other instruction sets (AVX-2, AVX-512).

Fast-BVH used to do something like this, removing it actually improved the performance because it led to better compiler optimizations.

For SIMD support, I would highly recommend just using the what the compiler already provides. gcc, clang and msvc should all have built-in support for using AMD64's SSE/AVX intrinsics and ARM's NEON/SVE intrinsics with a simple #include. Any other architecture that might have SIMD is too exotic to be worth the effort anyways.

I agree that it's practical to rely on compiler optimizations. Choosing to include the headers and write your own SIMD code is not relying on the compiler though, it's doing the work yourself. Only including the header buys you nothing.

For any algorithm: If you've made all efforts to optimize the algorithmic logic and cache usage (based on data collected by a profiler), then try and write an optimized function to improve the performance using either a SIMD library or ISPC. I've used ISPC extensively and generates SIMD code far better than any C++ compiler, due to its explicitly parallel syntax.

Also: Be extra aware of any "SIMD" code generated by the compiler. If you're looking at x86 SIMD code, make sure it's not just generating *_ss instructions (single scalar) as opposed to *_ps instructions (packed scalar). GCC and Clang (and MSVC, I think) will fallback to these instructions when they can't safely convert your code into data-parallel code.

DanielChappuis / reactphysics3d

Adding SIMD support for math calculations #164