error256 commented 3 years ago

113 reminded me about this thing... The target instruction set or architecture isn't specified now, so AVX intrinsics don't work in C++ and C

error: always_inline function '_mm256_add_pd' requires target feature 'avx', but would be inlined into function 'add' that is compiled without support for 'avx'

I think it would be reasonable to use march=native, which allows using available intrinsics and optimization for the current CPU architecture as well.

kazk commented 3 years ago

I can add -march=x86-64 -mtune=generic. -march=native will be too specific to the host VM, and I don't want that.

See https://stackoverflow.com/a/54163496 and https://stackoverflow.com/a/10134204

error256 commented 3 years ago

-march=x86-64 -mtune=generic doesn't really change anything here, I think it's the default. OK, I've solved my original problem with intrinsics with __attribute__((__target__("avx"))), so it doesn't matter that much now, but anyway... Solutions are compiled to be executed just once on the same type of system. So what exactly is the issue here, is it that there can possibly be significantly different CPUs that the difference in performance with native will be greater that without it? Technically, JIT compilers produce native code (at least by default, I think), so if this is a problem, I think it's already there with certain other languages and that's why C# or Java code can sometimes, at least in theory, be faster than C. Or am I missing something else here?

kazk commented 3 years ago

If -march=x86-64 doesn't do anything, forget it. I thought it still enabled what you wanted.

is it that there can possibly be significantly different CPUs that the difference in performance with native will be greater that without it?

No, I wasn't worried about performance much. I'll try explaining my thoughts based on my limited knowledge around this area. Please bear with me and let me know if I'm misunderstanding.

Assumptions:

-march=native adds compiler options very specific to the CPU
there are features only available to a specific CPU
the same code may not compile depending on that (not just unoptimized)
~~-march=x86-64 covers what you want~~

I guess my third assumption was incorrect?

Concern: Having code that depends on the current host VM CPU type. I don't change VM types often, but it's another variable that needs to be kept in mind if it was depended on, so I wanted to avoid it if possible (remember, I thought -march=x86-64 covered what you wanted). I also thought this might make testing the change against published kata more difficult.

Just for an example, on Google Cloud, there are VM machine types that trades cost for less control, for example E2 that doesn't guarantee the processor types. If we used this machine type with -march=native, submissions can be executed on different processor types (any of Intel Skylake, Broadwell, Haswell, and AMD EPYC Rome processors). Can this cause the same code to fail to compile depending on the VM it happened to be compiled on? Don't worry about performance differences. (I'm not planning to do this.)

I guess my point is that -march=native expands to many compiler options that's difficult to keep track of.

gcc v4.6.3 in 64-bit Ubuntu 12.04 which was running as a VMware Player guest. The VMware VM was running in Windows 7 on a desktop using an Intel Pentium Dual-Core E6500 CPU

StackOverflow answer

-march=native expanded to:

-march=core2 -mtune=core2 -mcx16 
-mno-abm -mno-aes -mno-avx -mno-bmi -mno-fma -mno-fma4 -mno-lwp 
-mno-movbe -mno-pclmul -mno-popcnt -mno-sse4.1 -mno-sse4.2 
-mno-tbm -mno-xop -msahf --param l1-cache-line-size=64 
--param l1-cache-size=32 --param l2-cache-size=2048

Instead of -march=native, is it possible to add some flags explicitly to get what you want?

kazk commented 3 years ago

To be clear, if you think -march=native is a safe default for our use case, I'm not against it. I trust you more on this.

error256 commented 3 years ago

To be clear, if you think -march=native is a safe default for our use case, I'm not against it. I trust you more on this.

I've never used it for anything serious, I've never even used C++ for anything serious, so don't just trust me.

Yes, code compiled with -march=native is supposed to be executed on CPUs with the same architecture as where it's compiled. I didn't know that it added the cache options, so it looks like it's supposed to be the same system, but I don't see how they can affect anything apart from optimization.

Instead of -march=native, is it possible to add some flags explicitly to get what you want?

Partially... Now that I've found the __target__ function attribute, it's not too important. For the whole program, instruction sets can be enabled separately, but all questions of compatibility will be mostly the same as with march, but if there's a known lower bound of CPUs, instruction sets can be enabled like -mavx2 or -mavx.

the same code may not compile depending on that (not just unoptimized) Can this cause the same code to fail to compile depending on the VM it happened to be compiled on?

Yes, if it uses functions that use instruction sets that may of may not be available, but...

I also thought this might make testing the change against published kata more difficult.

Things that work or don't work depending on the CPU model are already possible, even without the __target__ attribute there's inline asm, it's just much less convenient; so there isn't much that would change here.

uniapi commented 3 years ago

There should be no worrying about adding flags -mavx -mavx2 if you use machine not older than 2015 Q3 for AMD and Intel. Because adding -march=native may not succeed to enable AVX and AVX2 support and in this case the compilation stops with errors like these:

error: inlining failed in call to ‘always_inline’ ‘_mm256_add_epi64’: target specific option mismatch
  126 | _mm256_add_epi64 (__m256i __A, __m256i __B)

error: inlining failed in call to ‘always_inline’ ‘_mm256_and_si256’: target specific option mismatch
  179 | _mm256_and_si256 (__m256i __A, __m256i __B)

error: inlining failed in call to ‘always_inline’ ‘_mm256_cmpeq_epi64’: target specific option mismatch
  252 | _mm256_cmpeq_epi64 (__m256i __A, __m256i __B)

Just add: -mavx -mavx2

error256 commented 3 years ago

@uniapi I don't see how any other -m* can be safer than native. native should always match the current platform, so it should be the safest of all -ms as long as the program is compiled for one-time usage on the same machine.

Because adding -march=native may not succeed to enable AVX and AVX2 support

What circumstances are you talking about? It will only fail when there really is no AVX/AVX2 support. Do I misunderstand something?

uniapi commented 3 years ago

@error256 Ok! One of my machine is skylake and gcc is 11.1.0. Sure you do know skylake has support for AVX! So when i compile with the flag -march=native i do get the following errors:

error: inlining failed in call to ‘always_inline’ ‘_mm256_add_epi64’: target specific option mismatch
  126 | _mm256_add_epi64 (__m256i __A, __m256i __B)
      | ^~~~~~~~~~~~~~~~
error: inlining failed in call to ‘always_inline’ ‘_mm256_and_si256’: target specific option mismatch
  179 | _mm256_and_si256 (__m256i __A, __m256i __B)
      | ^~~~~~~~~~~~~~~~
error: inlining failed in call to ‘always_inline’ ‘_mm256_cmpeq_epi64’: target specific option mismatch
  252 | _mm256_cmpeq_epi64 (__m256i __A, __m256i __B)
      | ^~~~~~~~~~~~~~~~~~

But everything is ok when i do compile with: -mavx -mavx2

Yes -march=native is completely safe but it may not succeed to enable AVXes as i've just explained above. But it seems that Intel Pentium Dual-Core E6500 CPU does support AVX.

error256 commented 3 years ago

@uniapi First, it looks like you're using gcc. But the behavior should be the same: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Using -march=native enables all instruction subsets supported by the local machine...

So I don't know how you got that error. Are you using a VM? Does lscpu report avx in the flags? What march does gcc -v -E - -march=native </dev/null 2>&1 | grep march show?

uniapi commented 3 years ago

@error256

Using -march=native enables all instruction subsets supported by the local machine...

This is not truth!

Here is the option for -march=native on x86-64 skylake:

gcc -Q -march=native --help=target
The following options are target specific:
  -m128bit-long-double              [enabled]
  -m64                              [enabled]
  -m80387                           [enabled]
  -mabi=                            sysv
  -mabm                             [enabled]
  -maddress-mode=                   long
  -maes                             [enabled]
  -malign-data=                     compat
  -malign-functions=                0
  -malign-jumps=                    0
  -malign-loops=                    0
  -malign-stringops                 [enabled]
  -march=                           skylake
  -masm=                            att
  -mavx                             [enabled]
  -mbranch-cost=<0,5>               3
  -mclflushopt                      [enabled]
  -mcmodel=                         small
  -mcpu=
  -mcx16                            [enabled]
  -mfancy-math-387                  [enabled]
  -mfentry-name=
  -mfentry-section=
  -mfp-ret-in-387                   [enabled]
  -mfpmath=                         sse
  -mfunction-return=                keep
  -mfused-madd                      -ffp-contract=fast
  -mfxsr                            [enabled]
  -mhard-float                      [enabled]
  -mhle                             [enabled]
  -mieee-fp                         [enabled]
  -mincoming-stack-boundary=        0
  -mindirect-branch=                keep
  -minstrument-return=              none
  -mintel-syntax                    -masm=intel
  -mlarge-data-threshold=<number>   65536
  -mlong-double-80                  [enabled]
  -mlzcnt                           [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmmx                             [enabled]
  -mmovbe                           [enabled]
  -mpclmul                          [enabled]
  -mpopcnt                          [enabled]
  -mprefer-avx128                   -mprefer-vector-width=128
  -mprefer-vector-width=            none
  -mpreferred-stack-boundary=       0
  -mprfchw                          [enabled]
  -mpush-args                       [enabled]
  -mrdrnd                           [enabled]
  -mrdseed                          [enabled]
  -mrecip=
  -mred-zone                        [enabled]
  -mregparm=                        6
  -msahf                            [enabled]
  -msse                             [enabled]
  -msse2                            [enabled]
  -msse3                            [enabled]
  -msse4                            [enabled]
  -msse4.1                          [enabled]
  -msse4.2                          [enabled]
  -msse5                            -mavx
  -mssse3                           [enabled]
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard-symbol=
  -mstack-protector-guard=          global
  -mstringop-strategy=              [default]
  -mstv                             [enabled]
  -mtls-dialect=                    gnu
  -mtune-ctrl=
  -mtune=                           skylake
  -mveclibabi=                      [default]
  -mvzeroupper                      [enabled]
  -mxsave                           [enabled]

And this one for -march=skylake on the same platform:

gcc -Q -march=skylake --help=target
The following options are target specific:
  -m128bit-long-double              [enabled]
  -m64                              [enabled]
  -m80387                           [enabled]
  -mabi=                            sysv
  -maddress-mode=                   long
  -madx                             [enabled]
  -maes                             [enabled]
  -malign-data=                     compat
  -malign-functions=                0
  -malign-jumps=                    0
  -malign-loops=                    0
  -malign-stringops                 [enabled]
  -march=                           skylake
  -masm=                            att
  -mavx                             [enabled]
  -mavx2                            [enabled]
  -mbmi                             [enabled]
  -mbmi2                            [enabled]
  -mbranch-cost=<0,5>               3
  -mclflushopt                      [enabled]
  -mcmodel=                         small
  -mcpu=
  -mcx16                            [enabled]
  -mf16c                            [enabled]
  -mfancy-math-387                  [enabled]
  -mfentry-name=
  -mfentry-section=
  -mfma                             [enabled]
  -mfp-ret-in-387                   [enabled]
  -mfpmath=                         sse
  -mfsgsbase                        [enabled]
  -mfunction-return=                keep
  -mfused-madd                      -ffp-contract=fast
  -mfxsr                            [enabled]
  -mhard-float                      [enabled]
  -mhle                             [enabled]
  -mieee-fp                         [enabled]
  -mincoming-stack-boundary=        0
  -mindirect-branch=                keep
  -minstrument-return=              none
  -mintel-syntax                    -masm=intel
  -mlarge-data-threshold=<number>   65536
  -mlong-double-80                  [enabled]
  -mlzcnt                           [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmmx                             [enabled]
  -mmovbe                           [enabled]
  -mpclmul                          [enabled]
  -mpopcnt                          [enabled]
  -mprefer-avx128                   -mprefer-vector-width=128
  -mprefer-vector-width=            none
  -mpreferred-stack-boundary=       0
  -mprfchw                          [enabled]
  -mpush-args                       [enabled]
  -mrdrnd                           [enabled]
  -mrdseed                          [enabled]
  -mrecip=
  -mred-zone                        [enabled]
  -mregparm=                        6
  -msahf                            [enabled]
  -msgx                             [enabled]
  -msse                             [enabled]
  -msse2                            [enabled]
  -msse3                            [enabled]
  -msse4                            [enabled]
  -msse4.1                          [enabled]
  -msse4.2                          [enabled]
  -msse5                            -mavx
  -mssse3                           [enabled]
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard-symbol=
  -mstack-protector-guard=          global
  -mstringop-strategy=              [default]
  -mstv                             [enabled]
  -mtls-dialect=                    gnu
  -mtune-ctrl=
  -mtune=                           skylake
  -mveclibabi=                      [default]
  -mvzeroupper                      [enabled]
  -mxsave                           [enabled]
  -mxsavec                          [enabled]
  -mxsaveopt                        [enabled]
  -mxsaves                          [enabled]

So native is 10 lines shorter because not all instructions are enabled! Notice that avx2 is not enabled with native but is turned on with skylake That is the quoted statement is not true!

uniapi commented 3 years ago

The OS should have support for AVX enabled. And that is the same reason why gdb does not show ymm registers while debugging though you may successfully run AVX code like me!

error256 commented 3 years ago

Interesting observation, but it doesn't change the fact that, according to the documentation, native should enable all supported instruction sets, so it's probably a bug. I don't know your OS, but I've found this bug for gcc in macOS. But that's gcc again. What about clang?

uniapi commented 3 years ago

It's not a bug! The same thing with clang: 'native' throws an error and avx2 does not! and gdb version is 10.2 but it does not recognize ymm registers to print.

It's because the OS support is not enabled for AVX2. My OS is SunOS 11.3 or x86_64-pc-solaris2.11.

And here the output you asked from gcc -v -E - -march=native </dev/null 2>&1 | grep march:

Configured with: ../gcc/configure --prefix=/usr/gcc/11.1 --bindir=/usr/gcc/11.1/bin --sbindir=/usr/gcc/11.1/bin --libdir=/usr/gcc/11.1/lib --libexecdir=/usr/gcc/11.1/lib --mandir=/usr/gcc/11.1/share/man --infodir=/usr/gcc/11.1/share/info --with-gmp-lib=/usr/lib/amd64 --with-gmp-include=/usr/include --with-mpfr-lib=/usr/lib/amd64 --with-mpfr-include=/usr/include --with-mpc-lib=/usr/lib/amd64 --with-mpc-include=/usr/include --with-isl-lib=/usr/lib/amd64 --with-isl-include=/usr/include --without-gnu-ld --with-ld=/usr/bin/amd64/ld --with-gnu-as --with-as=/usr/binutils/2.35/bin/as --with-system-zlib --disable-werror --enable-multilib --enable-languages=c,c++,fortran,go,objc,obj-c++ --build=x86_64-pc-solaris2.11 CFLAGS='-O2 -march=skylake' CXXFLAGS='-O2 -march=skylake'
COLLECT_GCC_OPTIONS='-v' '-E' '-march=native'
 /usr/gcc/11.1-skylake/bin/../lib/gcc/x86_64-pc-solaris2.11/11.1.0/cc1 -E -quiet -v -iprefix /usr/gcc/11.1-skylake/bin/../lib/gcc/x86_64-pc-solaris2.11/11.1.0/ - -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=skylake -dumpbase -
COLLECT_GCC_OPTIONS='-v' '-E' '-march=native'

error256 commented 3 years ago

It means the behavior of -march=native is actually correct in this case and the actual problem is that "the OS support is not enabled for AVX2", which would need to be dealt with first in any serious scenario.

uniapi commented 3 years ago

Exactly! So the first option is to check the OS support and add -march=native if enabled. And the second option if the OS support is disabled is to check the CPU info (probably hardcoded) for AVX2 presence and add -mavx -mavx2 if present.

uniapi commented 3 years ago

Here was the hack for old ubuntu and i try to find the similar hack for my os.

Urfoex commented 3 years ago

Interesting topic and interesting find. But, isn't this for a coding challenge website? I wouldn't make it all science and hacky to get things running in weird scenarios. It should work in the general case without a huge support and maintenance cost. And pointing out what @kazk said at: https://github.com/codewars/runner/issues/118#issuecomment-829587954 This thing runs in the cloud without a fixed CPU. If you are going to specific, special and hacky, your code will work or not from one run to the next.

Phoronix sometimes does benchmarks on compiler optimizations: https://www.phoronix.com/scan.php?page=article&item=gcc-10900k-compiler&num=1 https://www.phoronix.com/scan.php?page=article&item=clang-12-opt&num=1 https://www.phoronix.com/scan.php?page=article&item=amd-znver3-gcc11&num=1 https://www.phoronix.com/scan.php?page=article&item=gcc10-gcc11-5950x&num=1 https://www.phoronix.com/scan.php?page=article&item=clang-12-5950x&num=1

C++ is a beast (not just) when it comes to optimizations. Without changing code, but the right compiler - flag - CPU combination, it can run way faster - or way slower. (Not to mention Profile-Guided-Optimizations.)

It is already annoying when challenge code passes or fails because of time constraint and being on a faster or slower machine. But having it pass or fail because it does or doesn't support some CPU feature - even more annoying, I'd say.

uniapi commented 3 years ago

@Urfoex i just shared the link for those who may have similar problems... But the Codewars solution is much more easier as i've described in two scenarios.

error256 commented 3 years ago

My suggestion is about enabling whatever the compiler thinks is available for the current system, not specifically about AVX/AVX2, which was supposed to be enabled automatically by that for any modern CPU. But it turns out it not necessarily is.

My OS is SunOS 11.3

Does it even happen on Linux? (=> Is this observation even important in this context?) What exactly the OS support for AVX2 is and why can it even be turned off? Surely there must be a valid reason...

uniapi commented 3 years ago

Also this scenario is observed when running on Virtual Machines even though you have a cool CPU.

But i have the same opinion that all instructions should be enabled but at least AVX/AVX2.

So then (according to you) the best scenario is to parse the target architecture and inject it to the compiler -march flag. If on skylake then -march=skylake If on skylake-avx512 then -march=skylake-avx512 If on pentium4m then -march=pentium4m and so on...

And it seems this is the best that we could have from the running CPU!

error256 commented 3 years ago

OS, VM, whatever, the question still holds. Why doesn't the virtual machine report AVX2? Is there a reason? Is it an option?

uniapi commented 3 years ago

@error256 'm not sure about that for 100%. I do know that you won't be able to use XSAVE if using Hyper-V and also know that it's possible (at least it seems possible) to turn on AVX2 support with VboxManage. So now i still do not have the answer! Probably the developers of those Operating Systems should know))

And yet! How did you succeed to use AVX2 intrinsics on Codewars? What constructions or attributes did you use?

error256 commented 3 years ago

How did you succeed to use AVX2 intrinsics on Codewars? What constructions or attributes did you use?

https://github.com/codewars/runner/issues/118#issuecomment-829545937 https://www.codewars.com/kumite/608aea3fa35c7c003251bb42?sel=608b0afb4865f700290a610f (AVX, AVX2 - no difference here.)

uniapi commented 3 years ago

@error256 thanks for your link! And yeah! you are right! It's not appropriate to use -march=avx2 because it's C and it should be portable! So if running on arm we could #include <arm_neon.h>. So vote for adding -march=native

codewars / runner

Compile C/C++ with march=native #118

113 reminded me about this thing... The target instruction set or architecture isn't specified now, so AVX intrinsics don't work in C++ and C

Just add: -mavx -mavx2