Open oscarbg opened 7 years ago
This will happen eventually. But not right now as there are some compatibility issues with VS2017 in my other projects and I'm not inclined to maintain two concurrent VS installations.
This is currently blocked by: https://developercommunity.visualstudio.com/content/problem/107204/msvc-fails-to-use-all-32-avx512-registers.html
FLOPs will not migrate AVX512 support from ICC to MSVC until that is fixed.
There are also tons of other non-blocking issues with AVX512 on MSVC that I'd like to see fixed before I make this move. As of now, Visual Studio's support for AVX512 is unusably buggy.
Hi, bugs were reported almost half a year ago,right? note there have been some Visual Studio 2017 point releases (from 15.1/15.2 at the time to 15.6 preview right now) so perhaps Microsoft have fixed major bugs.. no pressure(!) but have you reevaluated if bugs still persist? just interested to know how robust AVX512 codegen is right now in a code base like yours.. thanks..
I've taken a quick look at 15.5. And as far as I can tell, they have fixed nothing. Not even the ones they claim to have fixed.
TBH, AVX512 is extremely low on their priorities. My connections tell me that they really don't care nor do they have the man power to do it. And given that I've reported nearly all the AVX512 bugs, very few people are attempting to use it on Visual Studio.
So I'd say give them a few more years to get it together. A few more years should also be enough to know whether AVX512 will be become standard, or just "Intel's thing". I'm sure Microsoft is hoping it doesn't get traction since they really don't seem to want to care about AVX512.
thanks..
and sad to hear no much progress on this.. I'm hoping AMD gets interested on AVX512 now that ships a HEDT CPU in form of Threadripper and also we should get broad Intel AVX512 support from Icelake in 2019 at least..
just a question: have you tried (or expect to) Mingw64 Flops compilation to expose AVX512 some issues as shown on SWR AVX512 code (https://bugs.freedesktop.org/show_bug.cgi?id=101614)..
anyway seems the best way right now should be using Intel Windows compiler for compiling Mesa SWR AVX512 right?
I have not looked at MinGW in years because of the stack alignment bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412
I don't know if MinGW64 has a work-around. I've never bothered to look since the MSVC+ICC combination has always worked well enough that I haven't had to look elsewhere.
The GCC developers are insistent that the stack alignment problem cannot be fixed on Windows. Yet they seem to be completely unaware that both Visual Studio and the Intel Compiler have a very simple solution - introduce a second stack-pointer that's aligned.
I think part of the reason why this has been allowed to stay broken for 6 years is because anybody who seriously intends to use AVX on Windows is already using MSVC or the Intel Compiler. So there hasn't been a need to fix something that nobody uses even if it's broken.
Ok sorry for bothering you again but seems a lot of these bugs are fixed either on 15.6 final relased few days ago or first 15.7 betas which should come next week: I just keep here for easier searching by me in the future.. would be nice if you comment how hard remaining open "post 15.7" issues affect flops.. thanks.. 15.5: https://developercommunity.visualstudio.com/content/problem/114182/zmmintrinh-declaration-error.html 15.6: https://developercommunity.visualstudio.com/content/problem/107143/avx512-types-are-not-aligned-to-64-bytes.html 15.6:https://developercommunity.visualstudio.com/content/problem/107147/avx512-permute-macros-are-missing.html 15.6:https://developercommunity.visualstudio.com/content/problem/107192/avx512-masked-moves-get-miscompiled.html 15.7: https://developercommunity.visualstudio.com/content/problem/107204/msvc-fails-to-use-all-32-avx512-registers.html open :https://developercommunity.visualstudio.com/content/problem/175737/missing-zero-extension-avx-and-avx512-intrinsics.html open:https://developercommunity.visualstudio.com/content/problem/178393/avx512-compiler-doesnt-use-more-then-one-mask-regi.html open:https://developercommunity.visualstudio.com/content/problem/208512/illegal-avx512-instruction-for-zero-masked-move.html
15.6 is much better now. At least it's somewhat usable.
But the 32 registers bug is the only blocker left for FLOPs that remains open. Technically, the AVX512 kernels in FLOPs can be rewritten to fit under 16 registers and still get near peak performance. But that's a very crude work-around for a compiler bug that really should never have been there at all.
Good news I tested dayly build "VisualCppTools.Community.Daily.VS2017Layout-14.14.26329-Pre" and all your bugs are fixed!: I have cl versión 19.14.26329 (presumably should be in 2017 15.7 preview3 coming this week?) also added in zimmintrin are IFMA & VBMI intrinsics! hope you can share how good is the compiler (i.e. perf. of Flops app compiled via MSVC vs Intel Compiler) 15.7: https://developercommunity.visualstudio.com/content/problem/107204/msvc-fails-to-use-all-32-avx512-registers.html OK, making a dumpin /disasm of generated obj:
0000000000000482: 62 F1 FD 48 10 65 vmovupd zmm4,zmmword ptr [rbp+40h]
01
0000000000000489: 62 F1 FD 48 10 7D vmovupd zmm7,zmmword ptr [rbp+80h]
02
0000000000000490: 62 F1 FD 48 10 6D vmovupd zmm5,zmmword ptr [rbp+0C0h]
03
0000000000000497: 62 71 FD 48 10 45 vmovupd zmm8,zmmword ptr [rbp+100h]
04
000000000000049E: 62 71 FD 48 10 4D vmovupd zmm9,zmmword ptr [rbp+140h]
05
00000000000004A5: 62 F1 FD 48 10 5D vmovupd zmm3,zmmword ptr [rbp+180h]
06
00000000000004AC: 62 71 FD 48 10 55 vmovupd zmm10,zmmword ptr [rbp+1C0h]
07
00000000000004B3: 62 71 FD 48 10 5D vmovupd zmm11,zmmword ptr [rbp+200h]
08
00000000000004BA: 62 F1 FD 48 10 75 vmovupd zmm6,zmmword ptr [rbp+240h]
09
00000000000004C1: 62 71 FD 48 10 65 vmovupd zmm12,zmmword ptr [rbp+280h]
0A
00000000000004C8: 62 71 FD 48 10 6D vmovupd zmm13,zmmword ptr [rbp+2C0h]
0B
00000000000004CF: 62 71 FD 48 10 75 vmovupd zmm14,zmmword ptr [rbp+300h]
0C
00000000000004D6: 62 71 FD 48 10 7D vmovupd zmm15,zmmword ptr [rbp+340h]
0D
00000000000004DD: 62 E1 FD 48 10 45 vmovupd zmm16,zmmword ptr [rbp+380h]
0E
00000000000004E4: 62 E1 FD 48 10 4D vmovupd zmm17,zmmword ptr [rbp+3C0h]
0F
00000000000004EB: 62 E1 FD 48 10 55 vmovupd zmm18,zmmword ptr [rbp+400h]
10
00000000000004F2: 62 E1 FD 48 10 5D vmovupd zmm19,zmmword ptr [rbp+440h]
11
00000000000004F9: 62 E1 FD 48 10 65 vmovupd zmm20,zmmword ptr [rbp+480h]
12
0000000000000500: 62 E1 FD 48 10 6D vmovupd zmm21,zmmword ptr [rbp+4C0h]
13
0000000000000507: 62 E1 FD 48 10 75 vmovupd zmm22,zmmword ptr [rbp+500h]
14
000000000000050E: 62 E1 FD 48 10 7D vmovupd zmm23,zmmword ptr [rbp+540h]
15
0000000000000515: 62 61 FD 48 10 45 vmovupd zmm24,zmmword ptr [rbp+580h]
16
000000000000051C: 62 61 FD 48 10 4D vmovupd zmm25,zmmword ptr [rbp]
00
0000000000000523: C5 F8 57 C0 vxorps xmm0,xmm0,xmm0
0000000000000527: C5 FB 2A C0 vcvtsi2sd xmm0,xmm0,eax
000000000000052B: 62 F2 FD 48 19 D0 vbroadcastsd zmm2,xmm0
0000000000000531: 62 F2 FD 48 19 05 vbroadcastsd zmm0,mmword ptr [__real@3ff6a09e667f3bcd]
00 00 00 00
000000000000053B: 0F 1F 44 00 00 nop dword ptr [rax+rax]
0000000000000540: 62 62 FD 48 B8 C9 vfmadd231pd zmm25,zmm0,zmm1
0000000000000546: 62 62 FD 48 B8 C1 vfmadd231pd zmm24,zmm0,zmm1
000000000000054C: 62 E2 FD 48 B8 F9 vfmadd231pd zmm23,zmm0,zmm1
0000000000000552: 62 E2 FD 48 B8 F1 vfmadd231pd zmm22,zmm0,zmm1
0000000000000558: 62 E2 FD 48 B8 E9 vfmadd231pd zmm21,zmm0,zmm1
000000000000055E: 62 E2 FD 48 B8 E1 vfmadd231pd zmm20,zmm0,zmm1
0000000000000564: 62 E2 FD 48 B8 D9 vfmadd231pd zmm19,zmm0,zmm1
000000000000056A: 62 E2 FD 48 B8 D1 vfmadd231pd zmm18,zmm0,zmm1
0000000000000570: 62 E2 FD 48 B8 C9 vfmadd231pd zmm17,zmm0,zmm1
0000000000000576: 62 E2 FD 48 B8 C1 vfmadd231pd zmm16,zmm0,zmm1
000000000000057C: 62 72 FD 48 B8 F9 vfmadd231pd zmm15,zmm0,zmm1
0000000000000582: 62 72 FD 48 B8 F1 vfmadd231pd zmm14,zmm0,zmm1
0000000000000588: 62 72 FD 48 B8 E9 vfmadd231pd zmm13,zmm0,zmm1
000000000000058E: 62 72 FD 48 B8 E1 vfmadd231pd zmm12,zmm0,zmm1
0000000000000594: 62 F2 FD 48 B8 F1 vfmadd231pd zmm6,zmm0,zmm1
000000000000059A: 62 72 FD 48 B8 D9 vfmadd231pd zmm11,zmm0,zmm1
00000000000005A0: 62 72 FD 48 B8 D1 vfmadd231pd zmm10,zmm0,zmm1
00000000000005A6: 62 F2 FD 48 B8 D9 vfmadd231pd zmm3,zmm0,zmm1
00000000000005AC: 62 72 FD 48 B8 C9 vfmadd231pd zmm9,zmm0,zmm1
00000000000005B2: 62 72 FD 48 B8 C1 vfmadd231pd zmm8,zmm0,zmm1
00000000000005B8: 62 F2 FD 48 B8 E9 vfmadd231pd zmm5,zmm0,zmm1
00000000000005BE: 62 F2 FD 48 B8 F9 vfmadd231pd zmm7,zmm0,zmm1
00000000000005C4: 62 F2 FD 48 B8 E1 vfmadd231pd zmm4,zmm0,zmm1
00000000000005CA: 62 F2 FD 48 B8 D1 vfmadd231pd zmm2,zmm0,zmm1
00000000000005D0: 48 83 EB 01 sub rbx,1
00000000000005D4: 0F 85 66 FF FF FF jne 0000000000000540
00000000000005DA: 62 F1 8D 48 58 CA vaddpd zmm1,zmm14,zmm2
00000000000005E0: 62 F1 DD 40 58 C3 vaddpd zmm0,zmm20,zmm3
00000000000005E6: 62 F1 FD 48 58 D9 vaddpd zmm3,zmm0,zmm1
00000000000005EC: 62 F1 C5 40 58 CE vaddpd zmm1,zmm23,zmm6
00000000000005F2: 62 F1 F5 40 58 D5 vaddpd zmm2,zmm17,zmm5
00000000000005F8: 62 F1 F5 48 58 C2 vaddpd zmm0,zmm1,zmm2
00000000000005FE: 62 F1 FD 48 58 F3 vaddpd zmm6,zmm0,zmm3
0000000000000604: 62 F1 85 48 58 DC vaddpd zmm3,zmm15,zmm4
000000000000060A: 62 D1 D5 40 58 CA vaddpd zmm1,zmm21,zmm10
0000000000000610: 62 F1 F5 48 58 E3 vaddpd zmm4,zmm1,zmm3
0000000000000616: 62 D1 ED 40 58 D0 vaddpd zmm2,zmm18,zmm8
000000000000061C: 62 D1 BD 40 58 C4 vaddpd zmm0,zmm24,zmm12
0000000000000622: 62 F1 FD 48 58 CA vaddpd zmm1,zmm0,zmm2
0000000000000628: 62 F1 F5 48 58 EC vaddpd zmm5,zmm1,zmm4
000000000000062E: 62 F1 FD 40 58 DF vaddpd zmm3,zmm16,zmm7
0000000000000634: 62 D1 CD 40 58 C3 vaddpd zmm0,zmm22,zmm11
000000000000063A: 62 F1 FD 48 58 E3 vaddpd zmm4,zmm0,zmm3
0000000000000640: 62 D1 B5 40 58 CD vaddpd zmm1,zmm25,zmm13
0000000000000646: 62 D1 E5 40 58 D1 vaddpd zmm2,zmm19,zmm9
000000000000064C: 62 F1 F5 48 58 C2 vaddpd zmm0,zmm1,zmm2
0000000000000652: 62 F1 FD 48 58 D4 vaddpd zmm2,zmm0,zmm4
0000000000000658: 62 F1 ED 48 58 DD vaddpd zmm3,zmm2,zmm5
000000000000065E: 62 F1 E5 48 58 CE vaddpd zmm1,zmm3,zmm6
https://developercommunity.visualstudio.com/content/problem/175737/missing-zero-extension-avx-and-avx512-intrinsics.html compiles OK! (also I see in zmmintrin added:
// Zero-extended cast functions
extern __m512d __cdecl _mm512_zextpd128_pd512(__m128d);
extern __m512d __cdecl _mm512_zextpd256_pd512(__m256d);
extern __m512 __cdecl _mm512_zextps128_ps512(__m128);
extern __m512 __cdecl _mm512_zextps256_ps512(__m256);
extern __m512i __cdecl _mm512_zextsi128_si512(__m128i);
extern __m512i __cdecl _mm512_zextsi256_si512(__m256i);
) https://developercommunity.visualstudio.com/content/problem/208512/illegal-avx512-instruction-for-zero-masked-move.html OK, I compared disassembled code and old generated: vmovdqa32 zmmword ptr [rbp+40h]{k1}{z},zmm0 new: vmovdqa32 zmm16{k1}{z},zmm0
Nice! Can't wait for when these go live. If and when I get the time, I might try out the preview on one of my sandboxes whenever they're available again.
Realistically speaking, this FLOPs benchmark is not a great compiler benchmark. The operations are too trivial. And any compiler that does the basics will be able to reach the theoretical limit. IOW, failing to achieve the theoretical limit (or failing to compile at all) would be an indicator that the compiler isn't ready.
Just by eyeballing the inner loop of FMAs, its looks like it will finally achieve theoretical limit. So in this sense, MSVC has finally done the basics. The real test will be how it handles more complicated (real-life) intrinsic code that involves memory accesses and that actually puts pressure on the register allocator.
Hi, thanks for info.. FYI just installed preview 3 released today and ships with same exact compiler version: 19.14.26329 as tested before.. I almost forgot you are the author of y-cruncher and now seen: https://github.com/Mysticial/DigitViewer mentions in version 2: "Intel Compiler 2018 is required to build 17-Skylake. AVX512 support in Visual Studio is currently too buggy to use." is this an overall better AVX512 test (more "real life") than FLOPS? also don't know if y-cruncher is ready for MSVC compilation but if yes are you interested in trying to compile whole y-cruncher project with MSVC and provide a test build on your site as this project is tested on lots of hardware (even overclocked) and could help get a better overview of perf/stability of generated code under extreme OC vs current executable.. of course just my two cents.. feel no pressure to do so.. :-)
is this an overall better AVX512 test (more "real life") than FLOPS?
The Digit Viewer is definitely more realistic than FLOPs. But since there are no unit tests for it, it isn't great for testing the compiler. Sure it'll compile, but you won't know if it compiles correctly.
The reason why it lacks unit tests is because y-cruncher itself links it in and pretty much hits every corner of it under its integration tests. So I never bothered to give the Digit Viewer any dedicated tests.
also don't know if y-cruncher is ready for MSVC compilation but if yes are you interested in trying to compile whole y-cruncher project with MSVC
I maintain support for 3 compilers: MSVC, Intel Compiler (Windows), and GCC (Linux). So as of this writing, every single binary in y-cruncher will compile under:
With the following exceptions:
16-KNL
and 17-SKX
get miscompiled under MSVC due to this bug.18-CNL
won't compile under MSVC 15.6 due to it missing the AVX512-IFMA and AVX512-VBMI intrinsics.Intel Compiler 2018 and GCC 7.0 successfully and correctly compile everything. (Excluding a number of bugs in both compilers which have been successfully worked-around.)
provide a test build on your site as this project is tested on lots of hardware
Probably not the answer you're looking for, but 3 of the publicly released Windows binaries are compiled with MSVC: 04-P4P
, 05-A64
, and 11-BD1
I haven't released MSVC-compiled AVX binaries in years since it has fallen too far behind the Intel Compiler in that area.
Hi, nice for all info.. sorry I wasn't very informed about y-cruncher after all..
The AVX512 binaries, 16-KNL and 17-SKX get miscompiled under MSVC due to this bug.
that is fixed now as said but only in preview (15.7p3) right now.. EDIT:
The unreleased Cannonlake binary, 18-CNL won't compile under MSVC 15.6 due to it missing the AVX512-IFMA and AVX512-VBMI intrinsics.
this IFMA and VBMI is also on 15.7p3.. just a question isn't this new extensions expected to be on Icelake now?
I haven't released MSVC-compiled AVX binaries in years since it has fallen too far behind the Intel Compiler in that area.
ok nice to know that even MSVC generated AVX wasn't too performant.. then we can't hope AVX512 generated code will magically be much better..
nice for all info.. sorry I wasn't very informed about y-cruncher after all..
Nah, nothing complicated here. y-cruncher has about a dozen different binaries with names of the form, "13-HSW". The 3 letters is an abbreviation of the processor architecture and the number is the year that it retailed.
this IFMA and VBMI is also on 15.7p3.. just a question isn't this new extensions expected to be on Icelake now?
IFMA and VBMI is Cannonlake. Ice Lake has both of those + a bunch more stuff.
ok nice to know that even MSVC generated AVX wasn't too performant.. then we can't hope AVX512 generated code will magically be much better..
It's worth mentioning that GCC is fine. It isn't quite as good as the Intel Compiler, but it's consistently only a tiny bit worse. Whereas MSVC seems to fall further and further behind with each release. There was even a regression of about ~2% going from MSVC 2012 -> 2013. That was the last time I released an MSVC-compiled AVX binary for y-cruncher.
I'm testing out 15.7p3 now:
The illegal mask-move bug is not fixed. Fixing the 32 registers bug breaks the repro example since it no longer register spills. But once I increase the size to 32 to force a register spill again, the bad instruction comes back and crashes.
It ICE's when I try to compile y-cruncher with /arch:AVX512
. No ICE in 15.6.0.
So looks like it's gonna be a longer wait...
Bad news but thanks for joining the testing.. to be honest they didn't say anything about the mask-move bug being fixed.. just curiosity are you willing to report in some way the ICEs you are getting so they at least are aware of it? thanks..
Took me some time to isolate it: https://developercommunity.visualstudio.com/content/problem/232533/ice-with-archavx512-and-no-whole-program-optimizat.html
@Mystical just gave a try with VS 15.8 Preview 3 and seems this ICE is fixed now.. also I notice they say "illegal mask-move bug" is fixed.. will be nice to know how further you can get..
@oscarbg The ICE does look to be fixed in preview 3. But the miscompilation that I mentioned in the comments on that bug report isn't. I was hoping they were the same bug, but apparently that's not the case.
I'm not looking forward to tracking down and isolating this new miscompilation.
FFS, why does it feel like I'm single-handedly finding all the AVX512 bugs in MSVC?
The 2nd one is the cause of the new miscompilations. The 1st one I ran into by chance while trying to track down the 1st one.
@Mysticial thanks for your efforts.. I understand your pain, you are a good bug reporter, hope you don't lost all motivation to keep MSVC AVX512 compiler sane.. and they should start paying something for your bug reports seriously.. also I'm starting to believe AVX512 VC++ bugs while take a lot of time to fix.. and you were right that seems Intel Compiler is the unique serious compiler for Windows for compiling vectorized codes..
Hi @Mysticial , seems Visual Studio 2019 preview 2 ships and it's a huge compiler update with some improvements also in AVX codegen: https://blogs.msdn.microsoft.com/vcblog/2019/01/24/msvc-backend-updates-in-visual-studio-2019-preview-2/
also seems all your related bug reports like last time bugs are fixed:
would love to hear your findings on how buggy,mature and performant AVX512 codegen is this days on VC2019 preview 2 vs Intel Compiler..
thanks..
I don't have VS2019 yet. But I just tested VS2017 (15.9.4) and it looks all good for this benchmark. Not that I expected otherwise since I had it working with y-cruncher since VS 15.9.0. Performance is within a couple % of the Intel Compiler. The assembly looks reasonable. So good enough.
So I'll push an update to flip the "17-SkylakePurley" mode to VS. But "16-KnightsLanding" needs to stay with the Intel Compiler since VS doesn't support AVX512 without the Skylake flavors.
Good to hear
Hi, seems VS2017 adds AVX512 support.. can add support for it without requiring Intel COmpiler?