Add support for AVX512 with Visual Studio 2017 without Intel Compiler

oscarbg commented 7 years ago

Hi, seems VS2017 adds AVX512 support.. can add support for it without requiring Intel COmpiler?

Mysticial commented 7 years ago

This will happen eventually. But not right now as there are some compatibility issues with VS2017 in my other projects and I'm not inclined to maintain two concurrent VS installations.

Mysticial commented 7 years ago

This is currently blocked by: https://developercommunity.visualstudio.com/content/problem/107204/msvc-fails-to-use-all-32-avx512-registers.html

FLOPs will not migrate AVX512 support from ICC to MSVC until that is fixed.

There are also tons of other non-blocking issues with AVX512 on MSVC that I'd like to see fixed before I make this move. As of now, Visual Studio's support for AVX512 is unusably buggy.

oscarbg commented 6 years ago

Hi, bugs were reported almost half a year ago,right? note there have been some Visual Studio 2017 point releases (from 15.1/15.2 at the time to 15.6 preview right now) so perhaps Microsoft have fixed major bugs.. no pressure(!) but have you reevaluated if bugs still persist? just interested to know how robust AVX512 codegen is right now in a code base like yours.. thanks..

Mysticial commented 6 years ago

I've taken a quick look at 15.5. And as far as I can tell, they have fixed nothing. Not even the ones they claim to have fixed.

TBH, AVX512 is extremely low on their priorities. My connections tell me that they really don't care nor do they have the man power to do it. And given that I've reported nearly all the AVX512 bugs, very few people are attempting to use it on Visual Studio.

So I'd say give them a few more years to get it together. A few more years should also be enough to know whether AVX512 will be become standard, or just "Intel's thing". I'm sure Microsoft is hoping it doesn't get traction since they really don't seem to want to care about AVX512.

oscarbg commented 6 years ago

thanks..

and sad to hear no much progress on this.. I'm hoping AMD gets interested on AVX512 now that ships a HEDT CPU in form of Threadripper and also we should get broad Intel AVX512 support from Icelake in 2019 at least..

just a question: have you tried (or expect to) Mingw64 Flops compilation to expose AVX512 some issues as shown on SWR AVX512 code (https://bugs.freedesktop.org/show_bug.cgi?id=101614)..

anyway seems the best way right now should be using Intel Windows compiler for compiling Mesa SWR AVX512 right?

Mysticial commented 6 years ago

I have not looked at MinGW in years because of the stack alignment bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

I don't know if MinGW64 has a work-around. I've never bothered to look since the MSVC+ICC combination has always worked well enough that I haven't had to look elsewhere.

The GCC developers are insistent that the stack alignment problem cannot be fixed on Windows. Yet they seem to be completely unaware that both Visual Studio and the Intel Compiler have a very simple solution - introduce a second stack-pointer that's aligned.

I think part of the reason why this has been allowed to stay broken for 6 years is because anybody who seriously intends to use AVX on Windows is already using MSVC or the Intel Compiler. So there hasn't been a need to fix something that nobody uses even if it's broken.

oscarbg commented 6 years ago

Mysticial commented 6 years ago

15.6 is much better now. At least it's somewhat usable.

But the 32 registers bug is the only blocker left for FLOPs that remains open. Technically, the AVX512 kernels in FLOPs can be rewritten to fit under 16 registers and still get near peak performance. But that's a very crude work-around for a compiler bug that really should never have been there at all.

oscarbg commented 6 years ago

Good news I tested dayly build "VisualCppTools.Community.Daily.VS2017Layout-14.14.26329-Pre" and all your bugs are fixed!: I have cl versión 19.14.26329 (presumably should be in 2017 15.7 preview3 coming this week?) also added in zimmintrin are IFMA & VBMI intrinsics! hope you can share how good is the compiler (i.e. perf. of Flops app compiled via MSVC vs Intel Compiler) 15.7: https://developercommunity.visualstudio.com/content/problem/107204/msvc-fails-to-use-all-32-avx512-registers.html OK, making a dumpin /disasm of generated obj:

0000000000000482: 62 F1 FD 48 10 65  vmovupd     zmm4,zmmword ptr [rbp+40h]
                    01
  0000000000000489: 62 F1 FD 48 10 7D  vmovupd     zmm7,zmmword ptr [rbp+80h]
                    02
  0000000000000490: 62 F1 FD 48 10 6D  vmovupd     zmm5,zmmword ptr [rbp+0C0h]
                    03
  0000000000000497: 62 71 FD 48 10 45  vmovupd     zmm8,zmmword ptr [rbp+100h]
                    04
  000000000000049E: 62 71 FD 48 10 4D  vmovupd     zmm9,zmmword ptr [rbp+140h]
                    05
  00000000000004A5: 62 F1 FD 48 10 5D  vmovupd     zmm3,zmmword ptr [rbp+180h]
                    06
  00000000000004AC: 62 71 FD 48 10 55  vmovupd     zmm10,zmmword ptr [rbp+1C0h]
                    07
  00000000000004B3: 62 71 FD 48 10 5D  vmovupd     zmm11,zmmword ptr [rbp+200h]
                    08
  00000000000004BA: 62 F1 FD 48 10 75  vmovupd     zmm6,zmmword ptr [rbp+240h]
                    09
  00000000000004C1: 62 71 FD 48 10 65  vmovupd     zmm12,zmmword ptr [rbp+280h]
                    0A
  00000000000004C8: 62 71 FD 48 10 6D  vmovupd     zmm13,zmmword ptr [rbp+2C0h]
                    0B
  00000000000004CF: 62 71 FD 48 10 75  vmovupd     zmm14,zmmword ptr [rbp+300h]
                    0C
  00000000000004D6: 62 71 FD 48 10 7D  vmovupd     zmm15,zmmword ptr [rbp+340h]
                    0D
  00000000000004DD: 62 E1 FD 48 10 45  vmovupd     zmm16,zmmword ptr [rbp+380h]
                    0E
  00000000000004E4: 62 E1 FD 48 10 4D  vmovupd     zmm17,zmmword ptr [rbp+3C0h]
                    0F
  00000000000004EB: 62 E1 FD 48 10 55  vmovupd     zmm18,zmmword ptr [rbp+400h]
                    10
  00000000000004F2: 62 E1 FD 48 10 5D  vmovupd     zmm19,zmmword ptr [rbp+440h]
                    11
  00000000000004F9: 62 E1 FD 48 10 65  vmovupd     zmm20,zmmword ptr [rbp+480h]
                    12
  0000000000000500: 62 E1 FD 48 10 6D  vmovupd     zmm21,zmmword ptr [rbp+4C0h]
                    13
  0000000000000507: 62 E1 FD 48 10 75  vmovupd     zmm22,zmmword ptr [rbp+500h]
                    14
  000000000000050E: 62 E1 FD 48 10 7D  vmovupd     zmm23,zmmword ptr [rbp+540h]
                    15
  0000000000000515: 62 61 FD 48 10 45  vmovupd     zmm24,zmmword ptr [rbp+580h]
                    16
  000000000000051C: 62 61 FD 48 10 4D  vmovupd     zmm25,zmmword ptr [rbp]
                    00
  0000000000000523: C5 F8 57 C0        vxorps      xmm0,xmm0,xmm0
  0000000000000527: C5 FB 2A C0        vcvtsi2sd   xmm0,xmm0,eax
  000000000000052B: 62 F2 FD 48 19 D0  vbroadcastsd zmm2,xmm0
  0000000000000531: 62 F2 FD 48 19 05  vbroadcastsd zmm0,mmword ptr [__real@3ff6a09e667f3bcd]
                    00 00 00 00
  000000000000053B: 0F 1F 44 00 00     nop         dword ptr [rax+rax]
  0000000000000540: 62 62 FD 48 B8 C9  vfmadd231pd zmm25,zmm0,zmm1
  0000000000000546: 62 62 FD 48 B8 C1  vfmadd231pd zmm24,zmm0,zmm1
  000000000000054C: 62 E2 FD 48 B8 F9  vfmadd231pd zmm23,zmm0,zmm1
  0000000000000552: 62 E2 FD 48 B8 F1  vfmadd231pd zmm22,zmm0,zmm1
  0000000000000558: 62 E2 FD 48 B8 E9  vfmadd231pd zmm21,zmm0,zmm1
  000000000000055E: 62 E2 FD 48 B8 E1  vfmadd231pd zmm20,zmm0,zmm1
  0000000000000564: 62 E2 FD 48 B8 D9  vfmadd231pd zmm19,zmm0,zmm1
  000000000000056A: 62 E2 FD 48 B8 D1  vfmadd231pd zmm18,zmm0,zmm1
  0000000000000570: 62 E2 FD 48 B8 C9  vfmadd231pd zmm17,zmm0,zmm1
  0000000000000576: 62 E2 FD 48 B8 C1  vfmadd231pd zmm16,zmm0,zmm1
  000000000000057C: 62 72 FD 48 B8 F9  vfmadd231pd zmm15,zmm0,zmm1
  0000000000000582: 62 72 FD 48 B8 F1  vfmadd231pd zmm14,zmm0,zmm1
  0000000000000588: 62 72 FD 48 B8 E9  vfmadd231pd zmm13,zmm0,zmm1
  000000000000058E: 62 72 FD 48 B8 E1  vfmadd231pd zmm12,zmm0,zmm1
  0000000000000594: 62 F2 FD 48 B8 F1  vfmadd231pd zmm6,zmm0,zmm1
  000000000000059A: 62 72 FD 48 B8 D9  vfmadd231pd zmm11,zmm0,zmm1
  00000000000005A0: 62 72 FD 48 B8 D1  vfmadd231pd zmm10,zmm0,zmm1
  00000000000005A6: 62 F2 FD 48 B8 D9  vfmadd231pd zmm3,zmm0,zmm1
  00000000000005AC: 62 72 FD 48 B8 C9  vfmadd231pd zmm9,zmm0,zmm1
  00000000000005B2: 62 72 FD 48 B8 C1  vfmadd231pd zmm8,zmm0,zmm1
  00000000000005B8: 62 F2 FD 48 B8 E9  vfmadd231pd zmm5,zmm0,zmm1
  00000000000005BE: 62 F2 FD 48 B8 F9  vfmadd231pd zmm7,zmm0,zmm1
  00000000000005C4: 62 F2 FD 48 B8 E1  vfmadd231pd zmm4,zmm0,zmm1
  00000000000005CA: 62 F2 FD 48 B8 D1  vfmadd231pd zmm2,zmm0,zmm1
  00000000000005D0: 48 83 EB 01        sub         rbx,1
  00000000000005D4: 0F 85 66 FF FF FF  jne         0000000000000540
  00000000000005DA: 62 F1 8D 48 58 CA  vaddpd      zmm1,zmm14,zmm2
  00000000000005E0: 62 F1 DD 40 58 C3  vaddpd      zmm0,zmm20,zmm3
  00000000000005E6: 62 F1 FD 48 58 D9  vaddpd      zmm3,zmm0,zmm1
  00000000000005EC: 62 F1 C5 40 58 CE  vaddpd      zmm1,zmm23,zmm6
  00000000000005F2: 62 F1 F5 40 58 D5  vaddpd      zmm2,zmm17,zmm5
  00000000000005F8: 62 F1 F5 48 58 C2  vaddpd      zmm0,zmm1,zmm2
  00000000000005FE: 62 F1 FD 48 58 F3  vaddpd      zmm6,zmm0,zmm3
  0000000000000604: 62 F1 85 48 58 DC  vaddpd      zmm3,zmm15,zmm4
  000000000000060A: 62 D1 D5 40 58 CA  vaddpd      zmm1,zmm21,zmm10
  0000000000000610: 62 F1 F5 48 58 E3  vaddpd      zmm4,zmm1,zmm3
  0000000000000616: 62 D1 ED 40 58 D0  vaddpd      zmm2,zmm18,zmm8
  000000000000061C: 62 D1 BD 40 58 C4  vaddpd      zmm0,zmm24,zmm12
  0000000000000622: 62 F1 FD 48 58 CA  vaddpd      zmm1,zmm0,zmm2
  0000000000000628: 62 F1 F5 48 58 EC  vaddpd      zmm5,zmm1,zmm4
  000000000000062E: 62 F1 FD 40 58 DF  vaddpd      zmm3,zmm16,zmm7
  0000000000000634: 62 D1 CD 40 58 C3  vaddpd      zmm0,zmm22,zmm11
  000000000000063A: 62 F1 FD 48 58 E3  vaddpd      zmm4,zmm0,zmm3
  0000000000000640: 62 D1 B5 40 58 CD  vaddpd      zmm1,zmm25,zmm13
  0000000000000646: 62 D1 E5 40 58 D1  vaddpd      zmm2,zmm19,zmm9
  000000000000064C: 62 F1 F5 48 58 C2  vaddpd      zmm0,zmm1,zmm2
  0000000000000652: 62 F1 FD 48 58 D4  vaddpd      zmm2,zmm0,zmm4
  0000000000000658: 62 F1 ED 48 58 DD  vaddpd      zmm3,zmm2,zmm5
  000000000000065E: 62 F1 E5 48 58 CE  vaddpd      zmm1,zmm3,zmm6

https://developercommunity.visualstudio.com/content/problem/175737/missing-zero-extension-avx-and-avx512-intrinsics.html compiles OK! (also I see in zmmintrin added:

// Zero-extended cast functions
extern __m512d   __cdecl _mm512_zextpd128_pd512(__m128d);
extern __m512d   __cdecl _mm512_zextpd256_pd512(__m256d);
extern __m512    __cdecl _mm512_zextps128_ps512(__m128);
extern __m512    __cdecl _mm512_zextps256_ps512(__m256);
extern __m512i   __cdecl _mm512_zextsi128_si512(__m128i);
extern __m512i   __cdecl _mm512_zextsi256_si512(__m256i);

) https://developercommunity.visualstudio.com/content/problem/208512/illegal-avx512-instruction-for-zero-masked-move.html OK, I compared disassembled code and old generated: vmovdqa32 zmmword ptr [rbp+40h]{k1}{z},zmm0 new: vmovdqa32 zmm16{k1}{z},zmm0

Mysticial commented 6 years ago

Nice! Can't wait for when these go live. If and when I get the time, I might try out the preview on one of my sandboxes whenever they're available again.

Realistically speaking, this FLOPs benchmark is not a great compiler benchmark. The operations are too trivial. And any compiler that does the basics will be able to reach the theoretical limit. IOW, failing to achieve the theoretical limit (or failing to compile at all) would be an indicator that the compiler isn't ready.

Just by eyeballing the inner loop of FMAs, its looks like it will finally achieve theoretical limit. So in this sense, MSVC has finally done the basics. The real test will be how it handles more complicated (real-life) intrinsic code that involves memory accesses and that actually puts pressure on the register allocator.

oscarbg commented 6 years ago

Hi, thanks for info.. FYI just installed preview 3 released today and ships with same exact compiler version: 19.14.26329 as tested before.. I almost forgot you are the author of y-cruncher and now seen: https://github.com/Mysticial/DigitViewer mentions in version 2: "Intel Compiler 2018 is required to build 17-Skylake. AVX512 support in Visual Studio is currently too buggy to use." is this an overall better AVX512 test (more "real life") than FLOPS? also don't know if y-cruncher is ready for MSVC compilation but if yes are you interested in trying to compile whole y-cruncher project with MSVC and provide a test build on your site as this project is tested on lots of hardware (even overclocked) and could help get a better overview of perf/stability of generated code under extreme OC vs current executable.. of course just my two cents.. feel no pressure to do so.. :-)

Mysticial commented 6 years ago

is this an overall better AVX512 test (more "real life") than FLOPS?

The Digit Viewer is definitely more realistic than FLOPs. But since there are no unit tests for it, it isn't great for testing the compiler. Sure it'll compile, but you won't know if it compiles correctly.

The reason why it lacks unit tests is because y-cruncher itself links it in and pretty much hits every corner of it under its integration tests. So I never bothered to give the Digit Viewer any dedicated tests.

also don't know if y-cruncher is ready for MSVC compilation but if yes are you interested in trying to compile whole y-cruncher project with MSVC

I maintain support for 3 compilers: MSVC, Intel Compiler (Windows), and GCC (Linux). So as of this writing, every single binary in y-cruncher will compile under:

Windows - VS2017 15.6.0
Windows - Intel Compiler 2018
Ubuntu - GCC 7.1

With the following exceptions:

The AVX512 binaries, 16-KNL and 17-SKX get miscompiled under MSVC due to this bug.
The unreleased Cannonlake binary, 18-CNL won't compile under MSVC 15.6 due to it missing the AVX512-IFMA and AVX512-VBMI intrinsics.
Neither of the 32-bit binaries will compile in Linux since I don't support 32-bit there anyway.

Intel Compiler 2018 and GCC 7.0 successfully and correctly compile everything. (Excluding a number of bugs in both compilers which have been successfully worked-around.)

provide a test build on your site as this project is tested on lots of hardware

Probably not the answer you're looking for, but 3 of the publicly released Windows binaries are compiled with MSVC: 04-P4P, 05-A64, and 11-BD1

I haven't released MSVC-compiled AVX binaries in years since it has fallen too far behind the Intel Compiler in that area.

oscarbg commented 6 years ago

Hi, nice for all info.. sorry I wasn't very informed about y-cruncher after all..

The AVX512 binaries, 16-KNL and 17-SKX get miscompiled under MSVC due to this bug.

that is fixed now as said but only in preview (15.7p3) right now.. EDIT:

The unreleased Cannonlake binary, 18-CNL won't compile under MSVC 15.6 due to it missing the AVX512-IFMA and AVX512-VBMI intrinsics.

this IFMA and VBMI is also on 15.7p3.. just a question isn't this new extensions expected to be on Icelake now?

I haven't released MSVC-compiled AVX binaries in years since it has fallen too far behind the Intel Compiler in that area.

ok nice to know that even MSVC generated AVX wasn't too performant.. then we can't hope AVX512 generated code will magically be much better..

Mysticial commented 6 years ago

nice for all info.. sorry I wasn't very informed about y-cruncher after all..

Nah, nothing complicated here. y-cruncher has about a dozen different binaries with names of the form, "13-HSW". The 3 letters is an abbreviation of the processor architecture and the number is the year that it retailed.

this IFMA and VBMI is also on 15.7p3.. just a question isn't this new extensions expected to be on Icelake now?

IFMA and VBMI is Cannonlake. Ice Lake has both of those + a bunch more stuff.

ok nice to know that even MSVC generated AVX wasn't too performant.. then we can't hope AVX512 generated code will magically be much better..

It's worth mentioning that GCC is fine. It isn't quite as good as the Intel Compiler, but it's consistently only a tiny bit worse. Whereas MSVC seems to fall further and further behind with each release. There was even a regression of about ~2% going from MSVC 2012 -> 2013. That was the last time I released an MSVC-compiled AVX binary for y-cruncher.

Mysticial commented 6 years ago

I'm testing out 15.7p3 now:

The illegal mask-move bug is not fixed. Fixing the 32 registers bug breaks the repro example since it no longer register spills. But once I increase the size to 32 to force a register spill again, the bad instruction comes back and crashes.
It ICE's when I try to compile y-cruncher with /arch:AVX512. No ICE in 15.6.0.

So looks like it's gonna be a longer wait...

oscarbg commented 6 years ago

Bad news but thanks for joining the testing.. to be honest they didn't say anything about the mask-move bug being fixed.. just curiosity are you willing to report in some way the ICEs you are getting so they at least are aware of it? thanks..

Mysticial commented 6 years ago

Took me some time to isolate it: https://developercommunity.visualstudio.com/content/problem/232533/ice-with-archavx512-and-no-whole-program-optimizat.html

oscarbg commented 6 years ago

@Mystical just gave a try with VS 15.8 Preview 3 and seems this ICE is fixed now.. also I notice they say "illegal mask-move bug" is fixed.. will be nice to know how further you can get..

Mysticial commented 6 years ago

@oscarbg The ICE does look to be fixed in preview 3. But the miscompilation that I mentioned in the comments on that bug report isn't. I was hoping they were the same bug, but apparently that's not the case.

I'm not looking forward to tracking down and isolating this new miscompilation.

Mysticial commented 6 years ago

FFS, why does it feel like I'm single-handedly finding all the AVX512 bugs in MSVC?

The 2nd one is the cause of the new miscompilations. The 1st one I ran into by chance while trying to track down the 1st one.

oscarbg commented 6 years ago

@Mysticial thanks for your efforts.. I understand your pain, you are a good bug reporter, hope you don't lost all motivation to keep MSVC AVX512 compiler sane.. and they should start paying something for your bug reports seriously.. also I'm starting to believe AVX512 VC++ bugs while take a lot of time to fix.. and you were right that seems Intel Compiler is the unique serious compiler for Windows for compiling vectorized codes..

oscarbg commented 5 years ago

Hi @Mysticial , seems Visual Studio 2019 preview 2 ships and it's a huge compiler update with some improvements also in AVX codegen: https://blogs.msdn.microsoft.com/vcblog/2019/01/24/msvc-backend-updates-in-visual-studio-2019-preview-2/

also seems all your related bug reports like last time bugs are fixed:

would love to hear your findings on how buggy,mature and performant AVX512 codegen is this days on VC2019 preview 2 vs Intel Compiler..

thanks..

Mysticial commented 5 years ago

I don't have VS2019 yet. But I just tested VS2017 (15.9.4) and it looks all good for this benchmark. Not that I expected otherwise since I had it working with y-cruncher since VS 15.9.0. Performance is within a couple % of the Intel Compiler. The assembly looks reasonable. So good enough.

So I'll push an update to flip the "17-SkylakePurley" mode to VS. But "16-KnightsLanding" needs to stay with the Intel Compiler since VS doesn't support AVX512 without the Skylake flavors.

jarodrig commented 5 years ago

Good to hear

Mysticial / Flops

Add support for AVX512 with Visual Studio 2017 without Intel Compiler #16