Open error256 opened 3 years ago
I can add -march=x86-64 -mtune=generic
.
-march=native
will be too specific to the host VM, and I don't want that.
See https://stackoverflow.com/a/54163496 and https://stackoverflow.com/a/10134204
-march=x86-64 -mtune=generic
doesn't really change anything here, I think it's the default.
OK, I've solved my original problem with intrinsics with __attribute__((__target__("avx")))
, so it doesn't matter that much now, but anyway...
Solutions are compiled to be executed just once on the same type of system. So what exactly is the issue here, is it that there can possibly be significantly different CPUs that the difference in performance with native
will be greater that without it? Technically, JIT compilers produce native code (at least by default, I think), so if this is a problem, I think it's already there with certain other languages and that's why C# or Java code can sometimes, at least in theory, be faster than C. Or am I missing something else here?
If -march=x86-64
doesn't do anything, forget it. I thought it still enabled what you wanted.
is it that there can possibly be significantly different CPUs that the difference in performance with
native
will be greater that without it?
No, I wasn't worried about performance much. I'll try explaining my thoughts based on my limited knowledge around this area. Please bear with me and let me know if I'm misunderstanding.
Assumptions:
-march=native
adds compiler options very specific to the CPU-march=x86-64
covers what you wantI guess my third assumption was incorrect?
Concern: Having code that depends on the current host VM CPU type. I don't change VM types often, but it's another variable that needs to be kept in mind if it was depended on, so I wanted to avoid it if possible (remember, I thought -march=x86-64
covered what you wanted). I also thought this might make testing the change against published kata more difficult.
Just for an example, on Google Cloud, there are VM machine types that trades cost for less control, for example E2 that doesn't guarantee the processor types. If we used this machine type with -march=native
, submissions can be executed on different processor types (any of Intel Skylake, Broadwell, Haswell, and AMD EPYC Rome processors). Can this cause the same code to fail to compile depending on the VM it happened to be compiled on? Don't worry about performance differences. (I'm not planning to do this.)
I guess my point is that -march=native
expands to many compiler options that's difficult to keep track of.
gcc v4.6.3 in 64-bit Ubuntu 12.04 which was running as a VMware Player guest. The VMware VM was running in Windows 7 on a desktop using an Intel Pentium Dual-Core E6500 CPU
-march=native
expanded to:
-march=core2 -mtune=core2 -mcx16
-mno-abm -mno-aes -mno-avx -mno-bmi -mno-fma -mno-fma4 -mno-lwp
-mno-movbe -mno-pclmul -mno-popcnt -mno-sse4.1 -mno-sse4.2
-mno-tbm -mno-xop -msahf --param l1-cache-line-size=64
--param l1-cache-size=32 --param l2-cache-size=2048
Instead of -march=native
, is it possible to add some flags explicitly to get what you want?
To be clear, if you think -march=native
is a safe default for our use case, I'm not against it. I trust you more on this.
To be clear, if you think
-march=native
is a safe default for our use case, I'm not against it. I trust you more on this.
I've never used it for anything serious, I've never even used C++ for anything serious, so don't just trust me.
Yes, code compiled with -march=native
is supposed to be executed on CPUs with the same architecture as where it's compiled. I didn't know that it added the cache
options, so it looks like it's supposed to be the same system, but I don't see how they can affect anything apart from optimization.
Instead of
-march=native
, is it possible to add some flags explicitly to get what you want?
Partially... Now that I've found the __target__
function attribute, it's not too important. For the whole program, instruction sets can be enabled separately, but all questions of compatibility will be mostly the same as with march
, but if there's a known lower bound of CPUs, instruction sets can be enabled like -mavx2
or -mavx
.
the same code may not compile depending on that (not just unoptimized) Can this cause the same code to fail to compile depending on the VM it happened to be compiled on?
Yes, if it uses functions that use instruction sets that may of may not be available, but...
I also thought this might make testing the change against published kata more difficult.
Things that work or don't work depending on the CPU model are already possible, even without the __target__
attribute there's inline asm, it's just much less convenient; so there isn't much that would change here.
There should be no worrying about adding flags -mavx -mavx2
if you use machine not older than 2015 Q3 for AMD and Intel.
Because adding -march=native
may not succeed to enable AVX and AVX2 support and in this case the compilation stops with errors like these:
error: inlining failed in call to ‘always_inline’ ‘_mm256_add_epi64’: target specific option mismatch
126 | _mm256_add_epi64 (__m256i __A, __m256i __B)
error: inlining failed in call to ‘always_inline’ ‘_mm256_and_si256’: target specific option mismatch
179 | _mm256_and_si256 (__m256i __A, __m256i __B)
error: inlining failed in call to ‘always_inline’ ‘_mm256_cmpeq_epi64’: target specific option mismatch
252 | _mm256_cmpeq_epi64 (__m256i __A, __m256i __B)
@uniapi I don't see how any other -m*
can be safer than native
. native
should always match the current platform, so it should be the safest of all -m
s as long as the program is compiled for one-time usage on the same machine.
Because adding
-march=native
may not succeed to enable AVX and AVX2 support
What circumstances are you talking about? It will only fail when there really is no AVX/AVX2 support. Do I misunderstand something?
@error256
Ok! One of my machine is skylake
and gcc is 11.1.0
.
Sure you do know skylake has support for AVX!
So when i compile with the flag -march=native
i do get the following errors:
error: inlining failed in call to ‘always_inline’ ‘_mm256_add_epi64’: target specific option mismatch
126 | _mm256_add_epi64 (__m256i __A, __m256i __B)
| ^~~~~~~~~~~~~~~~
error: inlining failed in call to ‘always_inline’ ‘_mm256_and_si256’: target specific option mismatch
179 | _mm256_and_si256 (__m256i __A, __m256i __B)
| ^~~~~~~~~~~~~~~~
error: inlining failed in call to ‘always_inline’ ‘_mm256_cmpeq_epi64’: target specific option mismatch
252 | _mm256_cmpeq_epi64 (__m256i __A, __m256i __B)
| ^~~~~~~~~~~~~~~~~~
But everything is ok when i do compile with: -mavx -mavx2
Yes -march=native
is completely safe but it may not succeed to enable AVXes as i've just explained above.
But it seems that Intel Pentium Dual-Core E6500 CPU
does support AVX.
@uniapi First, it looks like you're using gcc. But the behavior should be the same: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Using
-march=native
enables all instruction subsets supported by the local machine...
So I don't know how you got that error. Are you using a VM? Does lscpu
report avx
in the flags? What march
does gcc -v -E - -march=native </dev/null 2>&1 | grep march
show?
@error256
Using -march=native enables all instruction subsets supported by the local machine...
This is not truth!
Here is the option for -march=native
on x86-64 skylake
:
gcc -Q -march=native --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m64 [enabled]
-m80387 [enabled]
-mabi= sysv
-mabm [enabled]
-maddress-mode= long
-maes [enabled]
-malign-data= compat
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-march= skylake
-masm= att
-mavx [enabled]
-mbranch-cost=<0,5> 3
-mclflushopt [enabled]
-mcmodel= small
-mcpu=
-mcx16 [enabled]
-mfancy-math-387 [enabled]
-mfentry-name=
-mfentry-section=
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mhard-float [enabled]
-mhle [enabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch= keep
-minstrument-return= none
-mintel-syntax -masm=intel
-mlarge-data-threshold=<number> 65536
-mlong-double-80 [enabled]
-mlzcnt [enabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmmx [enabled]
-mmovbe [enabled]
-mpclmul [enabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= none
-mpreferred-stack-boundary= 0
-mprfchw [enabled]
-mpush-args [enabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip=
-mred-zone [enabled]
-mregparm= 6
-msahf [enabled]
-msse [enabled]
-msse2 [enabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse5 -mavx
-mssse3 [enabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= global
-mstringop-strategy= [default]
-mstv [enabled]
-mtls-dialect= gnu
-mtune-ctrl=
-mtune= skylake
-mveclibabi= [default]
-mvzeroupper [enabled]
-mxsave [enabled]
And this one for -march=skylake
on the same platform:
gcc -Q -march=skylake --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m64 [enabled]
-m80387 [enabled]
-mabi= sysv
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-march= skylake
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mclflushopt [enabled]
-mcmodel= small
-mcpu=
-mcx16 [enabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mhard-float [enabled]
-mhle [enabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch= keep
-minstrument-return= none
-mintel-syntax -masm=intel
-mlarge-data-threshold=<number> 65536
-mlong-double-80 [enabled]
-mlzcnt [enabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmmx [enabled]
-mmovbe [enabled]
-mpclmul [enabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= none
-mpreferred-stack-boundary= 0
-mprfchw [enabled]
-mpush-args [enabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip=
-mred-zone [enabled]
-mregparm= 6
-msahf [enabled]
-msgx [enabled]
-msse [enabled]
-msse2 [enabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse5 -mavx
-mssse3 [enabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= global
-mstringop-strategy= [default]
-mstv [enabled]
-mtls-dialect= gnu
-mtune-ctrl=
-mtune= skylake
-mveclibabi= [default]
-mvzeroupper [enabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
So native
is 10 lines shorter because not all instructions are enabled!
Notice that avx2
is not enabled with native
but is turned on with skylake
That is the quoted statement is not true!
The OS should have support for AVX enabled.
And that is the same reason why gdb
does not show ymm
registers while debugging though you may successfully run AVX code like me!
Interesting observation, but it doesn't change the fact that, according to the documentation, native
should enable all supported instruction sets, so it's probably a bug. I don't know your OS, but I've found this bug for gcc in macOS.
But that's gcc again. What about clang?
It's not a bug!
The same thing with clang: 'native' throws an error and avx2
does not!
and gdb version is 10.2 but it does not recognize ymm registers to print.
It's because the OS support is not enabled for AVX2.
My OS is SunOS 11.3
or x86_64-pc-solaris2.11
.
And here the output you asked from gcc -v -E - -march=native </dev/null 2>&1 | grep march
:
Configured with: ../gcc/configure --prefix=/usr/gcc/11.1 --bindir=/usr/gcc/11.1/bin --sbindir=/usr/gcc/11.1/bin --libdir=/usr/gcc/11.1/lib --libexecdir=/usr/gcc/11.1/lib --mandir=/usr/gcc/11.1/share/man --infodir=/usr/gcc/11.1/share/info --with-gmp-lib=/usr/lib/amd64 --with-gmp-include=/usr/include --with-mpfr-lib=/usr/lib/amd64 --with-mpfr-include=/usr/include --with-mpc-lib=/usr/lib/amd64 --with-mpc-include=/usr/include --with-isl-lib=/usr/lib/amd64 --with-isl-include=/usr/include --without-gnu-ld --with-ld=/usr/bin/amd64/ld --with-gnu-as --with-as=/usr/binutils/2.35/bin/as --with-system-zlib --disable-werror --enable-multilib --enable-languages=c,c++,fortran,go,objc,obj-c++ --build=x86_64-pc-solaris2.11 CFLAGS='-O2 -march=skylake' CXXFLAGS='-O2 -march=skylake'
COLLECT_GCC_OPTIONS='-v' '-E' '-march=native'
/usr/gcc/11.1-skylake/bin/../lib/gcc/x86_64-pc-solaris2.11/11.1.0/cc1 -E -quiet -v -iprefix /usr/gcc/11.1-skylake/bin/../lib/gcc/x86_64-pc-solaris2.11/11.1.0/ - -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=skylake -dumpbase -
COLLECT_GCC_OPTIONS='-v' '-E' '-march=native'
It means the behavior of -march=native
is actually correct in this case and the actual problem is that "the OS support is not enabled for AVX2", which would need to be dealt with first in any serious scenario.
Exactly!
So the first option is to check the OS support and add -march=native
if enabled.
And the second option if the OS support is disabled is to check the CPU info (probably hardcoded) for AVX2 presence and add -mavx -mavx2
if present.
Here was the hack for old ubuntu and i try to find the similar hack for my os.
Interesting topic and interesting find. But, isn't this for a coding challenge website? I wouldn't make it all science and hacky to get things running in weird scenarios. It should work in the general case without a huge support and maintenance cost. And pointing out what @kazk said at: https://github.com/codewars/runner/issues/118#issuecomment-829587954 This thing runs in the cloud without a fixed CPU. If you are going to specific, special and hacky, your code will work or not from one run to the next.
Phoronix sometimes does benchmarks on compiler optimizations: https://www.phoronix.com/scan.php?page=article&item=gcc-10900k-compiler&num=1 https://www.phoronix.com/scan.php?page=article&item=clang-12-opt&num=1 https://www.phoronix.com/scan.php?page=article&item=amd-znver3-gcc11&num=1 https://www.phoronix.com/scan.php?page=article&item=gcc10-gcc11-5950x&num=1 https://www.phoronix.com/scan.php?page=article&item=clang-12-5950x&num=1
C++ is a beast (not just) when it comes to optimizations. Without changing code, but the right compiler - flag - CPU combination, it can run way faster - or way slower. (Not to mention Profile-Guided-Optimizations.)
It is already annoying when challenge code passes or fails because of time constraint and being on a faster or slower machine. But having it pass or fail because it does or doesn't support some CPU feature - even more annoying, I'd say.
@Urfoex i just shared the link for those who may have similar problems... But the Codewars solution is much more easier as i've described in two scenarios.
My suggestion is about enabling whatever the compiler thinks is available for the current system, not specifically about AVX/AVX2, which was supposed to be enabled automatically by that for any modern CPU. But it turns out it not necessarily is.
My OS is
SunOS 11.3
Does it even happen on Linux? (=> Is this observation even important in this context?) What exactly the OS support for AVX2 is and why can it even be turned off? Surely there must be a valid reason...
Also this scenario is observed when running on Virtual Machines even though you have a cool CPU.
But i have the same opinion that all instructions should be enabled but at least AVX/AVX2.
So then (according to you) the best scenario is to parse the target architecture and inject it to the compiler -march flag.
If on skylake then -march=skylake
If on skylake-avx512 then -march=skylake-avx512
If on pentium4m then -march=pentium4m
and so on...
And it seems this is the best that we could have from the running CPU!
OS, VM, whatever, the question still holds. Why doesn't the virtual machine report AVX2? Is there a reason? Is it an option?
@error256 'm not sure about that for 100%. I do know that you won't be able to use XSAVE if using Hyper-V and also know that it's possible (at least it seems possible) to turn on AVX2 support with VboxManage. So now i still do not have the answer! Probably the developers of those Operating Systems should know))
And yet! How did you succeed to use AVX2 intrinsics on Codewars? What constructions or attributes did you use?
How did you succeed to use AVX2 intrinsics on Codewars? What constructions or attributes did you use?
https://github.com/codewars/runner/issues/118#issuecomment-829545937 https://www.codewars.com/kumite/608aea3fa35c7c003251bb42?sel=608b0afb4865f700290a610f (AVX, AVX2 - no difference here.)
@error256 thanks for your link!
And yeah! you are right! It's not appropriate to use -march=avx2
because it's C and it should be portable!
So if running on arm we could #include <arm_neon.h>
.
So vote for adding -march=native
113 reminded me about this thing... The target instruction set or architecture isn't specified now, so AVX intrinsics don't work in C++ and C
I think it would be reasonable to use
march=native
, which allows using available intrinsics and optimization for the current CPU architecture as well.