Open fengyuleidian0615 opened 7 years ago
Hi, thank you for the report. I obviously made a mistake in naming things, it's not an AVX512BW code. For now, the only thing you can do is simply comment out that procedure.
On the other hand, it would be interesting to see how these 16-bit shuffles from ABV512BW can help in base64 algorithms.
@WojciechMula: Your http://0x80.pl/notesen/2016-04-03-avx512-base64.html write-up still says AVX512BW, not AVX512VBMI.
(Nice write up, BTW. I had the same idea for vpermb
/ vpmultishiftqb
/ vpermb
when discussing Base64 encoding in asm on a recent Stack Overflow question. I googled for vpmultishiftqb base64
and found your writeup which made it easy to follow your implementation and see that someone had already written up the code for this implementation.)
VPMULTISHIFTQB
also requires AVX512VBMI. The xmm/ymm versions also require AVX512VL (as usual), while the ZMM version only requires AVX512VBMI. Your writeup says it only requires AVX512VL.
I'm really curious how vpermb
and vpermi2b
will perform on Cannonlake (which will introduce AVX512VBMI). I expect it will be at least as slow as vpermw
or vpermi/t2w
are on Skylake-AVX512, where they decode to 2 or 3 shuffle uops respectively. But if they're only 2 or 3 uops, that's still fantastic. (I wouldn't be surprised if even vpermb
is 3 uops in the first-gen CPU to have it, though, before AVX512-accelerated software is widespread, but probably not so slow that it's not worth using for a lot of cases. Building very wide many-lane MUXers is expensive)
But if it's only 2 uops, then assume encode bottlenecks on shuffle throughput, we can probably produce 64 bytes of results per 4 clocks. Or per 6 clocks if it's 3 uops. That's pretty fantastic, and is approaching L2 bandwidth. I wonder if Cannonlake (or some future generation) will speed up word-element lane-crossing shuffles vs. Skylake-X.
I'm not sure how slow vpermi2b
would have to be before we'd want to avoid it for decode, though. A 7-bit table is very nice.
You might be able to use merge-masking into an existing mask for something, though. e.g. _mm512_movepi8_mask(input)
, and then some other mask-generating instruction can write that with merge-masking? Or hopefully a compiler could use kortest
with two separate operands... 2x VPMOVB2M
, one of them with merge-masking, isn't obviously better than VPORD
+ VPMOVB2M
, though, so I don't think there's anything to gain over the current vpermi2b
version if you're going to keep using vpermi2b
for decode.
@pcordes Hi, thank you for such a great comment. Right, I didn't update the www.
It's difficult to speculate about performance, especially when you remember what happened to AVX2 - due to overheating, CPU decreases the clock. You still get the result after X cycles, but the wall clock would say it's was slower. If Intel keep using high frequency rates, heating problem remain.
I would love to check the implementation against any real hardware, but it's quite difficult. :)
@pcordes you perhaps know the numbers, but it's worth to cite anyway https://twitter.com/InstLatX64/status/1054655575680827392:
The real #CannonLake implementation is 3|1 for VPERMB; 5|2 for VPERMI2B and VPERMT2B1
So, it's really, really fast. There's no info on uops count.
3 cycle latency and 1c throughput implies that it's a single uop. If there were any more uops it would be at least 4 cycle latency. Yes, I had seen that and it's surprisingly great, better than I thought we could hope for. But it's probably something that's worth throwing transistors at, because efficient shuffling makes it possible to do so much stuff that's otherwise not efficiently possible.
5|2
might be 3 uops, 2 of them for the shuffle port, with no ILP between them.
Note that's it's not only naming that's incorrect.
encode.avx512vl.cpp
uses AVX512VBMI (vpermb/vpmultishiftqb
) yet not all CPUs with AVX512VL have AVX512VBMI (chart). It actually uses no AVX512VL instructions at all (according to https://software.intel.com/sites/landingpage/IntrinsicsGuide).
Also encode.avx512vbmi.cpp
doesn't use vpmultishiftqb
to rearrange 6-bit indices, an AVX512VBMI instruction.
Note that's it's not only naming that's incorrect.
encode.avx512vl.cpp
uses AVX512VBMI (vpermb/vpmultishiftqb
) yet not all CPUs with AVX512VL have AVX512VBMI (chart). It actually uses no AVX512VL instructions at all (according to https://software.intel.com/sites/landingpage/IntrinsicsGuide).Also
encode.avx512vbmi.cpp
doesn't usevpmultishiftqb
to rearrange 6-bit indices, an AVX512VBMI instruction.
Thank you, will fix it. I AM confused with all these AVX512 extensions. :)
Hi
I'm running avx512bw test on my SKL which has avx512bw supported, while I got illegal instruction traps, and after some investigation, it seems vpermb/vpermi2b belongs to avx512vbmi instead, the CPU supported for avx512vbmi seems not officially released yet.
So does the code need a littler tweak to use avx512bw instruction for test?
]# gdb /tmp/check_avx512bw ./core.103927 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /tmp/check_avx512bw...done. [New LWP 103927] Core was generated by `/tmp/check_avx512bw'. Program terminated with signal 4, Illegal instruction.
0 0x00000000004082a5 in _mm512_permutex2var_epi8 (B=..., I=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h:107
107 /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h: No such file or directory.
[1] https://software.intel.com/en-us/node/534480
[2] https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf