bangerth / spec-cpuv8-sampleflow

A benchmark for the SPEC CPUv8 test suite based on the deal.II and SampleFlow libraries
GNU Lesser General Public License v2.1
0 stars 0 forks source link

Double check that we don't use platform-specific intrinsics. #21

Closed bangerth closed 1 year ago

bangerth commented 1 year ago

The stock version of deal.II uses intrinsics for vectorization where available. The benchmark should have this disabled. Double check that this so.

bangerth commented 1 year ago

What I find: Depending on platform, one makes processor intrinsics available via one of the following #include files: <intrin.h>, <altivec.h>, or <x86intrin.h>. The only place where any of these are included in deal.II is in the file dealii/include/deal.II/vectorization.h:

> egrep -r 'intrin.h|altivec.h|x86intrin.h' dealii/
dealii/include/deal.II/base/vectorization.h:// #    include <intrin.h>
dealii/include/deal.II/base/vectorization.h:// #    include <altivec.h>
dealii/include/deal.II/base/vectorization.h:// // altivec.h defines vector, pixel, bool, but we do not use them, so undefine
dealii/include/deal.II/base/vectorization.h:// #    include <x86intrin.h>

These are all commented out -- the code block looks like this:

#if DEAL_II_VECTORIZATION_WIDTH_IN_BITS > 0

// These error messages try to detect the case that deal.II was compiled with
// a wider instruction set extension as the current compilation unit, for
// example because deal.II was compiled with AVX, but a user project does not
// add -march=native or similar flags, making it fall to SSE2. This leads to
// very strange errors as the size of data structures differs between the
// compiled deal.II code sitting in libdeal_II.so and the user code if not
// detected.
#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && !defined(__AVX__)
#    error \
      "Mismatch in vectorization capabilities: AVX was detected during configuration of deal.II and switched on, but it is apparently not available for the file you are trying to compile at the moment. Check compilation flags controlling the instruction set, such as -march=native."
#  endif
#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && !defined(__AVX512F__)
#    error \
      "Mismatch in vectorization capabilities: AVX-512F was detected during configuration of deal.II and switched on, but it is apparently not available for the file you are trying to compile at the moment. Check compilation flags controlling the instruction set, such as -march=native."
#  endif

// The SPEC benchmark does not use vectorization intrinsics
// #  ifdef _MSC_VER
// #    include <intrin.h>
// #  elif defined(__ALTIVEC__)
// #    include <altivec.h>

// // altivec.h defines vector, pixel, bool, but we do not use them, so undefine
// // them before they make trouble
// #    undef vector
// #    undef pixel
// #    undef bool
// #  else
// #    include <x86intrin.h>
// #  endif

#endif

Moreover, in dealii/include/deal.II/base/config.h, we have

#define DEAL_II_VECTORIZATION_WIDTH_IN_BITS 0

So not only are the includes commented out, the whole block is actually removed by the preprocessor.

bangerth commented 1 year ago

The situation with BOOST is slightly more complicated, principally because the benchmark contains a large chunk of the BOOST source code but actually only uses a small part of it. (It is quite difficult to disentangle BOOST and only include a smaller piece of it because headers depend on each other.) Intrinsics are #included here:

> egrep -r 'intrin.h|altivec.h|x86intrin.h' dealii/bundled/
dealii/bundled/boost-1.70.0/include/boost/atomic/detail/interlocked.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/atomic/detail/ops_msvc_arm.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/detail/interlocked.hpp:// VC9 has intrin.h, but it collides with <utility>
dealii/bundled/boost-1.70.0/include/boost/detail/interlocked.hpp:// MinGW-w64 provides intrin.h for both 32 and 64-bit targets.
dealii/bundled/boost-1.70.0/include/boost/detail/interlocked.hpp:// We have to use intrin.h on Cygwin 64
dealii/bundled/boost-1.70.0/include/boost/detail/interlocked.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/integer/common_factor_rt.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/math/special_functions/detail/lanczos_sse2.hpp:#include <emmintrin.h>
dealii/bundled/boost-1.70.0/include/boost/math/special_functions/next.hpp:#include "xmmintrin.h"
dealii/bundled/boost-1.70.0/include/boost/multiprecision/detail/bitscan.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/atomic_count_sync.hpp:# include <ia64intrin.h>
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/sp_counted_base_sync.hpp:# include <ia64intrin.h>
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/sp_interlocked.hpp:// VC9 has intrin.h, but it collides with <utility>
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/sp_interlocked.hpp:// MinGW-w64 provides intrin.h for both 32 and 64-bit targets.
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/sp_interlocked.hpp:// We have to use intrin.h on Cygwin 64
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/sp_interlocked.hpp:#include <intrin.h>
dealii/bundled/boost-1.70.0/include/boost/smart_ptr/detail/spinlock_sync.hpp:# include <ia64intrin.h>
dealii/bundled/boost-1.70.0/include/boost/thread/win32/interlocked_read.hpp:#include <intrin.h>

Some of these files really make use of intrinsics, like this one in interlocked_read.hpp:

#elif defined(_MSC_VER) && _MSC_VER >= 1700 && (defined(_M_ARM) || defined(_M_ARM64))

#include <intrin.h>

namespace boost
{
    namespace detail
    {
        inline long interlocked_read_acquire(long volatile* x) BOOST_NOEXCEPT
        {
            long const res=__iso_volatile_load32((const volatile __int32*)x);
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            __dmb(0xB); // _ARM_BARRIER_ISH, see armintr.h from MSVC 11 and later
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            return res;
        }
        inline void* interlocked_read_acquire(void* volatile* x) BOOST_NOEXCEPT
        {
            void* const res=
#if defined(_M_ARM64)
                (void*)__iso_volatile_load64((const volatile __int64*)x);
#else
                (void*)__iso_volatile_load32((const volatile __int32*)x);
#endif
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            __dmb(0xB); // _ARM_BARRIER_ISH, see armintr.h from MSVC 11 and later
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            return res;
        }

        inline void interlocked_write_release(long volatile* x,long value) BOOST_NOEXCEPT
        {
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            __dmb(0xB); // _ARM_BARRIER_ISH, see armintr.h from MSVC 11 and later
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            __iso_volatile_store32((volatile __int32*)x, (__int32)value);
        }
        inline void interlocked_write_release(void* volatile* x,void* value) BOOST_NOEXCEPT
        {
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
            __dmb(0xB); // _ARM_BARRIER_ISH, see armintr.h from MSVC 11 and later
            BOOST_THREAD_DETAIL_COMPILER_BARRIER();
#if defined(_M_ARM64)
            __iso_volatile_store64((volatile __int64*)x, (__int64)value);
#else
            __iso_volatile_store32((volatile __int32*)x, (__int32)value);
#endif
        }
    }
}

There are also places such as atomic_count_synch.hpp:

#if defined( __ia64__ ) && defined( __INTEL_COMPILER )
# include <ia64intrin.h>
#endif

namespace boost
{

namespace detail
{

class atomic_count
{
public:

    explicit atomic_count( long v ) : value_( v ) {}

    long operator++()
    {
        return __sync_add_and_fetch( &value_, 1 );
    }
[...]

And bitscan.hpp:

#if (defined(BOOST_MSVC) || (defined(__clang__) && defined(__c2__)) || (defined(BOOST_INTEL) && defined(_MSC_VER))) && (defined(_M_IX86) || defined(_M_X64))

#pragma intrinsic(_BitScanForward,_BitScanReverse)

BOOST_FORCEINLINE unsigned find_lsb(unsigned long mask, const mpl::int_<1>&)
{
   unsigned long result;
   _BitScanForward(&result, mask);
   return result;
}

Similar things can be found in the other files listed above.

I don't quite know how to approach verifying that the benchmark doesn't use any intrinsics, in particular because there is no canonical list of intrinsics that are declared in these header files and because the inclusion of these header files is only guarded by checks for specific compilers, but not by specific preprocessor defines that could be disabled. My best guess is that the benchmark actually runs into one or the other of these places, but I don't know how to check.

What I can say is that as far as I know, BOOST uses vectorization/floating point intrinsics (rather than the bitcount/atomic/... integer intrinsics shown above) in only one place: lanczos_sse2.hpp:

#include <emmintrin.h>

#if defined(__GNUC__) || defined(__PGI) || defined(__SUNPRO_CC)
#define ALIGN16 __attribute__((__aligned__(16)))
#else
#define ALIGN16 __declspec(align(16))
#endif

namespace boost{ namespace math{ namespace lanczos{

template <>
inline double lanczos13m53::lanczos_sum<double>(const double& x)
{
   static const ALIGN16 double coeff[26] = {
      static_cast<double>(2.506628274631000270164908177133837338626L),
      static_cast<double>(1u),
      static_cast<double>(210.8242777515793458725097339207133627117L),
      static_cast<double>(66u),
      static_cast<double>(8071.672002365816210638002902272250613822L),
      static_cast<double>(1925u),
      static_cast<double>(186056.2653952234950402949897160456992822L),
      static_cast<double>(32670u),
      static_cast<double>(2876370.628935372441225409051620849613599L),
      static_cast<double>(357423u),
      static_cast<double>(31426415.58540019438061423162831820536287L),
      static_cast<double>(2637558u),
      static_cast<double>(248874557.8620541565114603864132294232163L),
      static_cast<double>(13339535u),
      static_cast<double>(1439720407.311721673663223072794912393972L),
      static_cast<double>(45995730u),
      static_cast<double>(6039542586.35202800506429164430729792107L),
      static_cast<double>(105258076u),
      static_cast<double>(17921034426.03720969991975575445893111267L),
      static_cast<double>(150917976u),
      static_cast<double>(35711959237.35566804944018545154716670596L),
      static_cast<double>(120543840u),
      static_cast<double>(42919803642.64909876895789904700198885093L),
      static_cast<double>(39916800u),
      static_cast<double>(23531376880.41075968857200767445163675473L),
      static_cast<double>(0u)
   };
   __m128d vx = _mm_load1_pd(&x);
   __m128d sum_even = _mm_load_pd(coeff);
   __m128d sum_odd = _mm_load_pd(coeff+2);
  ...

This file is #included in only one place:

> grep -r lanczos_sse2.hpp dealii/
dealii/bundled/boost-1.70.0/include/boost/math/special_functions/lanczos.hpp:#include <boost/math/special_functions/detail/lanczos_sse2.hpp>

This header file is included from a number of other places, and so it is difficult to see whether its contents are used in the benchmark. That said, the classes declared in this file are all called lanczos_something and I have verified that we don't use any of these in the deal.II code base.

bangerth commented 1 year ago

In short: deal.II itself does not use intrinsics or assembler inlines. I cannot say for sure whether BOOST does, but I am confident that BOOST doesn't use vectorization intrinsics that make it into the benchmark.

bangerth commented 1 year ago

There are also some places where deal.II explicitly references AVX instructions. These are:

$ grep -rl AVX *
dealii/source/matrix_free/mapping_info.cc
dealii/source/base/utilities.cc
dealii/bundled/boost-1.70.0/include/boost/predef/hardware/simd/x86/versions.h
dealii/bundled/boost-1.70.0/include/boost/predef/hardware/simd/x86.h
dealii/bundled/boost-1.70.0/include/boost/atomic/detail/ops_msvc_x86.hpp
dealii/bundled/boost-1.70.0/include/boost/atomic/detail/ops_gcc_x86_dcas.hpp
dealii/include/deal.II/fe/mapping_q.h
dealii/include/deal.II/base/vectorization.h
dealii/include/deal.II/base/numbers.h
dealii/include/deal.II/base/config.h
dealii/include/deal.II/base/utilities.h
dealii/include/deal.II/base/config.h.in

The actual guarding preprocessor #define is actually called __AVX__ (and __AVX512__). Here are the places where this is used:

dealii/include/deal.II/base/vectorization.h:// #ifdef __AVX512F__
dealii/include/deal.II/base/vectorization.h:// #elif defined (__AVX__)
dealii/include/deal.II/base/vectorization.h:// In addition to checking the flags __AVX512F__, __AVX__ and __SSE2__, a CMake
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && !defined(__AVX__)
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && !defined(__AVX512F__)
dealii/include/deal.II/base/vectorization.h:// for safety, also check that __AVX512F__ is defined in case the user manually
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && defined(__AVX512F__)
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && defined(__AVX__)
dealii/include/deal.II/base/vectorization.h:#    ifdef __AVX2__
dealii/include/deal.II/base/vectorization.h:#    ifdef __AVX2__
dealii/include/deal.II/base/vectorization.h:#if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && defined(__AVX__)
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && defined(__AVX512F__)
dealii/include/deal.II/base/vectorization.h:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && defined(__AVX__)
dealii/include/deal.II/base/numbers.h:#elif DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && defined(__AVX512F__)
dealii/include/deal.II/base/numbers.h:#elif DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && defined(__AVX__)
dealii/source/matrix_free/mapping_info.cc:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 256 && defined(__AVX__)
dealii/source/matrix_free/mapping_info.cc:#  if DEAL_II_VECTORIZATION_WIDTH_IN_BITS >= 512 && defined(__AVX512F__)

As mentioned above, DEAL_II_VECTORIZATION_WIDTH_IN_BITS == 0 for this benchmark. So the only places that need to be checked are the ones that do not reference DEAL_II_VECTORIZATION_WIDTH_IN_BITS, leaving only the two places in vectorization.h that are not already commented out. I've confirmed that these are all ultimately guarded by DEAL_II_VECTORIZATION_WIDTH_IN_BITS and so inactive as well.