Closed olaayeko closed 4 years ago
Inline ASM error? This one might be for @chriselrod or @YingboMa . Can you share versioninfo()
?
Is this what you mean by version info ? :
Julia Version 1.4.1 Commit 381693d3df* (2020-04-14 17:20 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)
There seems to be a mismatch:
CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
This is a very new Ice Lake CPU, featuring FMA3 and AVX512F instruction sets among others, but:
LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)
LLVM treats this as a CPU without any modern instruction sets.
CpuId.jl is probably correctly reporting that the CPU is capable of executing certain assembly instructions, causing some code in LoopVectorization to emit them.
But then because a Goldmon CPU is incapable of doing so, LLVM complains. I'm not sure why that line doesn't read:
LLVM: libLLVM-8.0.1 (ORCJIT, icelake)
LLVM 8 is new enough to support icelake. Having your CPU properly recognized as icelake should make a lot of code run much faster.
Interesting, is there a workaround ?
How did you install Julia?
I think I used Home-brew. The differential equations module was working fine until today
Could you try installing an official binary instead? Package managers have been known to cause all sorts of problems with dependencies, like llvm.
Something else you could try is starting Julia with
julia -Cicelake-client
If you have both the homebrew and official binaries installed, may I suggest you try:
using BenchmarkTools, StaticArrays
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
@benchmark $A * $B
on both versions, and report their respective timings.
Assuming that the official binary works correctly, it should be several times faster than the homebrew version, because LLVM will do a much better job optimizing code when it actually knows what CPU that code is running on (inline ASM issues aside).
I uninstalled Julia with home-brew and installed the official binary, but I am still getting the error. The benchmark test produced this
BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0
minimum time: 32.553 ns (0.00% GC) median time: 32.592 ns (0.00% GC) mean time: 33.066 ns (0.00% GC) maximum time: 198.355 ns (0.00% GC)
samples: 10000 evals/sample: 993
Could you share your new version info?
As well as
@code_native A * B
?
33ns doesn't sound bad.
Version Information:
Julia Version 1.4.2 Commit 44fa15b150* (2020-05-23 18:35 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)
Still goldmont, that is bizarre.
Have you tried starting julia with julia -Cicelake-client
?
This is the error I got:
olaayeko@Olas-MBP ~ % julia -Cicelake-client ERROR: Your CPU does not support the CX16 instruction, which is required by this version of Julia! This is often due to running inside of a virtualized environment. Please read https://docs.julialang.org/en/stable/devdocs/sysimg/ for more.
Could you show me the @code_native
from the StaticArrays example?
As well as julia -Chelp
?
You could also try julia -Cskylake-avx512
.
Julia -Chelp
Available CPUs for this target:
amdfam10 - Select the amdfam10 processor. athlon - Select the athlon processor. athlon-4 - Select the athlon-4 processor. athlon-fx - Select the athlon-fx processor. athlon-mp - Select the athlon-mp processor. athlon-tbird - Select the athlon-tbird processor. athlon-xp - Select the athlon-xp processor. athlon64 - Select the athlon64 processor. athlon64-sse3 - Select the athlon64-sse3 processor. atom - Select the atom processor. barcelona - Select the barcelona processor. bdver1 - Select the bdver1 processor. bdver2 - Select the bdver2 processor. bdver3 - Select the bdver3 processor. bdver4 - Select the bdver4 processor. bonnell - Select the bonnell processor. broadwell - Select the broadwell processor. btver1 - Select the btver1 processor. btver2 - Select the btver2 processor. c3 - Select the c3 processor. c3-2 - Select the c3-2 processor. cannonlake - Select the cannonlake processor. cascadelake - Select the cascadelake processor. core-avx-i - Select the core-avx-i processor. core-avx2 - Select the core-avx2 processor. core2 - Select the core2 processor. corei7 - Select the corei7 processor. corei7-avx - Select the corei7-avx processor. generic - Select the generic processor. geode - Select the geode processor. goldmont - Select the goldmont processor. goldmont-plus - Select the goldmont-plus processor. haswell - Select the haswell processor. i386 - Select the i386 processor. i486 - Select the i486 processor. i586 - Select the i586 processor. i686 - Select the i686 processor. icelake-client - Select the icelake-client processor. icelake-server - Select the icelake-server processor. ivybridge - Select the ivybridge processor. k6 - Select the k6 processor. k6-2 - Select the k6-2 processor. k6-3 - Select the k6-3 processor. k8 - Select the k8 processor. k8-sse3 - Select the k8-sse3 processor. knl - Select the knl processor. knm - Select the knm processor. lakemont - Select the lakemont processor. nehalem - Select the nehalem processor. nocona - Select the nocona processor. opteron - Select the opteron processor. opteron-sse3 - Select the opteron-sse3 processor. penryn - Select the penryn processor. pentium - Select the pentium processor. pentium-m - Select the pentium-m processor. pentium-mmx - Select the pentium-mmx processor. pentium2 - Select the pentium2 processor. pentium3 - Select the pentium3 processor. pentium3m - Select the pentium3m processor. pentium4 - Select the pentium4 processor. pentium4m - Select the pentium4m processor. pentiumpro - Select the pentiumpro processor. prescott - Select the prescott processor. sandybridge - Select the sandybridge processor. silvermont - Select the silvermont processor. skx - Select the skx processor. skylake - Select the skylake processor. skylake-avx512 - Select the skylake-avx512 processor. slm - Select the slm processor. tremont - Select the tremont processor. westmere - Select the westmere processor. winchip-c6 - Select the winchip-c6 processor. winchip2 - Select the winchip2 processor. x86-64 - Select the x86-64 processor. yonah - Select the yonah processor. znver1 - Select the znver1 processor.
Available features for this target:
16bit-mode - 16-bit mode (i8086). 32bit-mode - 32-bit mode (80386). 3dnow - Enable 3DNow! instructions. 3dnowa - Enable 3DNow! Athlon instructions. 64bit - Support 64-bit instructions. 64bit-mode - 64-bit mode (x86_64). adx - Support ADX instructions. aes - Enable AES instructions. atom - Intel Atom processors. avx - Enable AVX instructions. avx2 - Enable AVX2 instructions. avx512bitalg - Enable AVX-512 Bit Algorithms. avx512bw - Enable AVX-512 Byte and Word Instructions. avx512cd - Enable AVX-512 Conflict Detection Instructions. avx512dq - Enable AVX-512 Doubleword and Quadword Instructions. avx512er - Enable AVX-512 Exponential and Reciprocal Instructions. avx512f - Enable AVX-512 instructions. avx512ifma - Enable AVX-512 Integer Fused Multiple-Add. avx512pf - Enable AVX-512 PreFetch Instructions. avx512vbmi - Enable AVX-512 Vector Byte Manipulation Instructions. avx512vbmi2 - Enable AVX-512 further Vector Byte Manipulation Instructions. avx512vl - Enable AVX-512 Vector Length eXtensions. avx512vnni - Enable AVX-512 Vector Neural Network Instructions. avx512vpopcntdq - Enable AVX-512 Population Count Instructions. bmi - Support BMI instructions. bmi2 - Support BMI2 instructions. cldemote - Enable Cache Demote. clflushopt - Flush A Cache Line Optimized. clwb - Cache Line Write Back. clzero - Enable Cache Line Zero. cmov - Enable conditional move instructions. cx16 - 64-bit with cmpxchg16b. ermsb - REP MOVS/STOS are fast. f16c - Support 16-bit floating point conversion instructions. false-deps-lzcnt-tzcnt - LZCNT/TZCNT have a false dependency on dest register. false-deps-popcnt - POPCNT has a false dependency on dest register. fast-11bytenop - Target can quickly decode up to 11 byte NOPs. fast-15bytenop - Target can quickly decode up to 15 byte NOPs. fast-bextr - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.. fast-gather - Indicates if gather is reasonably fast.. fast-hops - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles. fast-lzcnt - LZCNT instructions are as fast as most simple integer ops. fast-partial-ymm-or-zmm-write - Partial writes to YMM/ZMM registers are fast. fast-scalar-fsqrt - Scalar SQRT is fast (disable Newton-Raphson). fast-shld-rotate - SHLD can be used as a faster rotate. fast-variable-shuffle - Shuffles with variable masks are fast. fast-vector-fsqrt - Vector SQRT is fast (disable Newton-Raphson). fma - Enable three-operand fused multiple-add. fma4 - Enable four-operand fused multiple-add. fsgsbase - Support FS/GS Base instructions. fxsr - Support fxsave/fxrestore instructions. gfni - Enable Galois Field Arithmetic Instructions. glm - Intel Goldmont processors. glp - Intel Goldmont Plus processors. idivl-to-divb - Use 8-bit divide for positive values less than 256. idivq-to-divl - Use 32-bit divide for positive values less than 2^32. invpcid - Invalidate Process-Context Identifier. lea-sp - Use LEA for adjusting the stack pointer. lea-uses-ag - LEA instruction needs inputs at AG stage. lwp - Enable LWP instructions. lzcnt - Support LZCNT instruction. macrofusion - Various instructions can be fused with conditional branches. merge-to-threeway-branch - Merge branches to a three-way conditional branch. mmx - Enable MMX instructions. movbe - Support MOVBE instruction. movdir64b - Support movdir64b instruction. movdiri - Support movdiri instruction. mpx - Support MPX instructions. mwaitx - Enable MONITORX/MWAITX timer functionality. nopl - Enable NOPL instruction. pad-short-functions - Pad short functions. pclmul - Enable packed carry-less multiplication instructions. pconfig - platform configuration instruction. pku - Enable protection keys. popcnt - Support POPCNT instruction. prefer-256-bit - Prefer 256-bit AVX instructions. prefetchwt1 - Prefetch with Intent to Write and T1 Hint. prfchw - Support PRFCHW instructions. ptwrite - Support ptwrite instruction. rdpid - Support RDPID instructions. rdrnd - Support RDRAND instruction. rdseed - Support RDSEED instruction. retpoline - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.. retpoline-external-thunk - When lowering an indirect call or branch using a
retpoline
, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.. retpoline-indirect-branches - Remove speculation of indirect branches from the generated code.. retpoline-indirect-calls - Remove speculation of indirect calls from the generated code.. rtm - Support RTM instructions. sahf - Support LAHF and SAHF instructions. sgx - Enable Software Guard Extensions. sha - Enable SHA instructions. shstk - Support CET Shadow-Stack instructions. slm - Intel Silvermont processors. slow-3ops-lea - LEA instruction with 3 ops or certain registers is slow. slow-incdec - INC and DEC instructions are slower than ADD and SUB. slow-lea - LEA instruction with certain arguments is slow. slow-pmaddwd - PMADDWD is slower than PMULLD. slow-pmulld - PMULLD instruction is slow. slow-shld - SHLD instruction is slow. slow-two-mem-ops - Two memory operand instructions are slow. slow-unaligned-mem-16 - Slow unaligned 16-byte memory access. slow-unaligned-mem-32 - Slow unaligned 32-byte memory access. soft-float - Use software floating point features.. sse - Enable SSE instructions. sse-unaligned-mem - Allow unaligned memory operands with SSE instructions. sse2 - Enable SSE2 instructions. sse3 - Enable SSE3 instructions. sse4.1 - Enable SSE 4.1 instructions. sse4.2 - Enable SSE 4.2 instructions. sse4a - Support SSE 4a instructions. ssse3 - Enable SSSE3 instructions. tbm - Enable TBM instructions. tremont - Intel Tremont processors. vaes - Promote selected AES instructions to AVX512/AVX registers. vpclmulqdq - Enable vpclmulqdq instructions. waitpkg - Wait and pause enhancements. wbnoinvd - Write Back No Invalidate. x87 - Enable X87 float instructions. xop - Enable XOP instructions. xsave - Support xsave instructions. xsavec - Support xsavec instructions. xsaveopt - Support xsaveopt instructions. xsaves - Support xsaves instructions.
Sorry I am not really sure what you mean by @code_native
With julia -Cskylake-avx512, I still got the same error
Could you run:
using BenchmarkTools, StaticArrays
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
@code_native debuginfo=:none A * B
@code_native debuginfo=:none A * B
.section TEXT,text,regular,pure_instructions subq $856, %rsp ## imm = 0x358 movq %rdi, %rax vbroadcastsd (%rdx), %ymm10 vmovupd (%rsi), %ymm0 vmovupd %ymm0, -64(%rsp) vmovupd 64(%rsi), %ymm1 vmovupd %ymm1, 64(%rsp) vmulpd %ymm10, %ymm0, %ymm3 vbroadcastsd 8(%rdx), %ymm11 vmulpd %ymm11, %ymm1, %ymm4 vaddpd %ymm4, %ymm3, %ymm4 vmovupd 128(%rsi), %ymm0 vmovupd %ymm0, -128(%rsp) vbroadcastsd 16(%rdx), %ymm12 vmulpd %ymm12, %ymm0, %ymm5 vaddpd %ymm5, %ymm4, %ymm5 vmovupd 192(%rsi), %ymm0 vmovupd %ymm0, -32(%rsp) vbroadcastsd 24(%rdx), %ymm13 vmulpd %ymm13, %ymm0, %ymm6 vaddpd %ymm6, %ymm5, %ymm6 vmovupd 256(%rsi), %ymm0 vmovupd %ymm0, 192(%rsp) vbroadcastsd 32(%rdx), %ymm14 vmulpd %ymm14, %ymm0, %ymm7 vaddpd %ymm7, %ymm6, %ymm7 vmovupd 320(%rsi), %ymm0 vmovupd %ymm0, 32(%rsp) vbroadcastsd 40(%rdx), %ymm15 vmulpd %ymm15, %ymm0, %ymm8 vaddpd %ymm8, %ymm7, %ymm8 vmovupd 384(%rsi), %ymm0 vmovupd %ymm0, 256(%rsp) vbroadcastsd 48(%rdx), %ymm1 vmulpd %ymm1, %ymm0, %ymm9 vaddpd %ymm9, %ymm8, %ymm9 vmovupd 448(%rsi), %ymm2 vmovupd %ymm2, 160(%rsp) vbroadcastsd 56(%rdx), %ymm0 vmulpd %ymm0, %ymm2, %ymm2 vaddpd %ymm2, %ymm9, %ymm2 vmovupd %ymm2, 800(%rsp) vmovupd 32(%rsi), %ymm2 vmovupd %ymm2, 288(%rsp) vmulpd %ymm10, %ymm2, %ymm2 vmovupd 96(%rsi), %ymm3 vmovupd %ymm3, (%rsp) vmulpd %ymm11, %ymm3, %ymm11 vaddpd %ymm11, %ymm2, %ymm2 vmovupd 160(%rsi), %ymm3 vmovupd %ymm3, 224(%rsp) vmulpd %ymm12, %ymm3, %ymm12 vaddpd %ymm12, %ymm2, %ymm2 vmovupd 224(%rsi), %ymm3 vmovupd %ymm3, -96(%rsp) vmulpd %ymm13, %ymm3, %ymm13 vaddpd %ymm13, %ymm2, %ymm2 vmovupd 288(%rsi), %ymm13 vmulpd %ymm14, %ymm13, %ymm14 vaddpd %ymm14, %ymm2, %ymm2 vmovupd 352(%rsi), %ymm3 vmovupd %ymm3, 320(%rsp) vmulpd %ymm15, %ymm3, %ymm15 vaddpd %ymm15, %ymm2, %ymm2 vmovupd 416(%rsi), %ymm3 vmovupd %ymm3, 96(%rsp) vmulpd %ymm1, %ymm3, %ymm1 vaddpd %ymm1, %ymm2, %ymm2 vmovupd 480(%rsi), %ymm1 vmovupd %ymm1, 128(%rsp) vmulpd %ymm0, %ymm1, %ymm0 vaddpd %ymm0, %ymm2, %ymm0 vmovupd %ymm0, 768(%rsp) vbroadcastsd 64(%rdx), %ymm1 vmovupd -64(%rsp), %ymm15 vmulpd %ymm1, %ymm15, %ymm2 vbroadcastsd 72(%rdx), %ymm0 vmulpd 64(%rsp), %ymm0, %ymm4 vaddpd %ymm4, %ymm2, %ymm2 vbroadcastsd 80(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vbroadcastsd 88(%rdx), %ymm5 vmovupd -32(%rsp), %ymm14 vmulpd %ymm5, %ymm14, %ymm6 vaddpd %ymm6, %ymm2, %ymm2 vbroadcastsd 96(%rdx), %ymm6 vmovupd 192(%rsp), %ymm11 vmulpd %ymm6, %ymm11, %ymm7 vaddpd %ymm7, %ymm2, %ymm2 vbroadcastsd 104(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm2, %ymm2 vbroadcastsd 112(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm2, %ymm2 vbroadcastsd 120(%rdx), %ymm9 vmovupd 160(%rsp), %ymm12 vmulpd %ymm9, %ymm12, %ymm10 vaddpd %ymm10, %ymm2, %ymm2 vmovupd %ymm2, 736(%rsp) vmovupd 288(%rsp), %ymm3 vmulpd %ymm1, %ymm3, %ymm1 vmulpd (%rsp), %ymm0, %ymm0 vaddpd %ymm0, %ymm1, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm13, 352(%rsp) vmulpd %ymm6, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 320(%rsp), %ymm7, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 704(%rsp) vbroadcastsd 128(%rdx), %ymm0 vmulpd %ymm0, %ymm15, %ymm1 vbroadcastsd 136(%rdx), %ymm2 vmovupd 64(%rsp), %ymm15 vmulpd %ymm2, %ymm15, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 144(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 152(%rdx), %ymm5 vmulpd %ymm5, %ymm14, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 160(%rdx), %ymm6 vmulpd %ymm6, %ymm11, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 168(%rdx), %ymm7 vmovupd 32(%rsp), %ymm11 vmulpd %ymm7, %ymm11, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 176(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 184(%rdx), %ymm9 vmulpd %ymm9, %ymm12, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 672(%rsp) vmulpd %ymm0, %ymm3, %ymm0 vmovupd (%rsp), %ymm12 vmulpd %ymm2, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm6, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 320(%rsp), %ymm13 vmulpd %ymm7, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 96(%rsp), %ymm14 vmulpd %ymm8, %ymm14, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 640(%rsp) vbroadcastsd 192(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 200(%rdx), %ymm2 vmulpd %ymm2, %ymm15, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 208(%rdx), %ymm4 vmovupd -128(%rsp), %ymm3 vmulpd %ymm4, %ymm3, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 216(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 224(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 232(%rdx), %ymm7 vmulpd %ymm7, %ymm11, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 240(%rdx), %ymm8 vmovupd 256(%rsp), %ymm11 vmulpd %ymm8, %ymm11, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 248(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 608(%rsp) vmulpd 288(%rsp), %ymm0, %ymm0 vmulpd %ymm2, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 224(%rsp), %ymm12 vmulpd %ymm4, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm7, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm8, %ymm14, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 128(%rsp), %ymm15 vmulpd %ymm9, %ymm15, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 576(%rsp) vbroadcastsd 256(%rdx), %ymm0 vmovupd -64(%rsp), %ymm13 vmulpd %ymm0, %ymm13, %ymm1 vbroadcastsd 264(%rdx), %ymm2 vmovupd 64(%rsp), %ymm14 vmulpd %ymm2, %ymm14, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 272(%rdx), %ymm4 vmulpd %ymm4, %ymm3, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 280(%rdx), %ymm5 vmovupd -32(%rsp), %ymm3 vmulpd %ymm5, %ymm3, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 288(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 296(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 304(%rdx), %ymm8 vmulpd %ymm8, %ymm11, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 312(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 544(%rsp) vmovupd 288(%rsp), %ymm11 vmulpd %ymm0, %ymm11, %ymm0 vmulpd (%rsp), %ymm2, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm4, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd -96(%rsp), %ymm12 vmulpd %ymm5, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 320(%rsp), %ymm7, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm9, %ymm15, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 512(%rsp) vbroadcastsd 320(%rdx), %ymm0 vmulpd %ymm0, %ymm13, %ymm1 vbroadcastsd 328(%rdx), %ymm2 vmulpd %ymm2, %ymm14, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 336(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 344(%rdx), %ymm5 vmulpd %ymm5, %ymm3, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 352(%rdx), %ymm6 vmovupd 192(%rsp), %ymm15 vmulpd %ymm6, %ymm15, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 360(%rdx), %ymm7 vmovupd 32(%rsp), %ymm14 vmulpd %ymm7, %ymm14, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 368(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 376(%rdx), %ymm9 vmovupd 160(%rsp), %ymm13 vmulpd %ymm9, %ymm13, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 480(%rsp) vmulpd %ymm0, %ymm11, %ymm0 vmovupd (%rsp), %ymm3 vmulpd %ymm2, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm5, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 352(%rsp), %ymm11 vmulpd %ymm6, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 320(%rsp), %ymm12 vmulpd %ymm7, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 448(%rsp) vbroadcastsd 384(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 392(%rdx), %ymm2 vmulpd 64(%rsp), %ymm2, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 400(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 408(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 416(%rdx), %ymm6 vmulpd %ymm6, %ymm15, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 424(%rdx), %ymm7 vmulpd %ymm7, %ymm14, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 432(%rdx), %ymm8 vmovupd 256(%rsp), %ymm14 vmulpd %ymm8, %ymm14, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 440(%rdx), %ymm9 vmulpd %ymm9, %ymm13, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 416(%rsp) vmovupd 288(%rsp), %ymm15 vmulpd %ymm0, %ymm15, %ymm0 vmulpd %ymm2, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 224(%rsp), %ymm13 vmulpd %ymm4, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm6, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm7, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 96(%rsp), %ymm3 vmulpd %ymm8, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 128(%rsp), %ymm11 vmulpd %ymm9, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 384(%rsp) vbroadcastsd 448(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 456(%rdx), %ymm2 vmulpd 64(%rsp), %ymm2, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 464(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 472(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 480(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 488(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 496(%rdx), %ymm8 vmulpd %ymm8, %ymm14, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 504(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmulpd %ymm0, %ymm15, %ymm0 vmulpd (%rsp), %ymm2, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm4, %ymm13, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm7, %ymm12, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm8, %ymm3, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm9, %ymm11, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmovups 800(%rsp), %ymm2 vmovups %ymm2, (%rdi) vmovups 768(%rsp), %ymm2 vmovups %ymm2, 32(%rdi) vmovups 736(%rsp), %ymm2 vmovups %ymm2, 64(%rdi) vmovups 704(%rsp), %ymm2 vmovups %ymm2, 96(%rdi) vmovups 672(%rsp), %ymm2 vmovups %ymm2, 128(%rdi) vmovups 640(%rsp), %ymm2 vmovups %ymm2, 160(%rdi) vmovups 608(%rsp), %ymm2 vmovups %ymm2, 192(%rdi) vmovups 576(%rsp), %ymm2 vmovups %ymm2, 224(%rdi) vmovups 544(%rsp), %ymm2 vmovups %ymm2, 256(%rdi) vmovups 512(%rsp), %ymm2 vmovups %ymm2, 288(%rdi) vmovups 480(%rsp), %ymm2 vmovups %ymm2, 320(%rdi) vmovups 448(%rsp), %ymm2 vmovups %ymm2, 352(%rdi) vmovups 416(%rsp), %ymm2 vmovups %ymm2, 384(%rdi) vmovups 384(%rsp), %ymm2 vmovups %ymm2, 416(%rdi) vmovupd %ymm1, 448(%rdi) vmovupd %ymm0, 480(%rdi) addq $856, %rsp ## imm = 0x358 vzeroupper retq nopw %cs:(%rax,%rax) nop
Interesting. A goldmont CPU shouldn't be able to use ymm registers.
Could you try running that again, but this time after starting Julia with julia --math-mode=fast
?
@code_native debuginfo=:none A * B
.section TEXT,text,regular,pure_instructions subq $152, %rsp vbroadcastsd (%rdx), %ymm0 vmovupd (%rsi), %ymm3 vmovupd 32(%rsi), %ymm5 vbroadcastsd 8(%rdx), %ymm6 vmovupd 64(%rsi), %ymm11 vmovupd 96(%rsi), %ymm12 vbroadcastsd 128(%rdx), %ymm1 vbroadcastsd 192(%rdx), %ymm9 vbroadcastsd 256(%rdx), %ymm10 vbroadcastsd 72(%rdx), %ymm13 vbroadcastsd 16(%rdx), %ymm7 movq %rdi, %rax vmulpd %ymm3, %ymm0, %ymm2 vmulpd %ymm5, %ymm0, %ymm4 vbroadcastsd 64(%rdx), %ymm0 vmulpd %ymm3, %ymm9, %ymm8 vmulpd %ymm5, %ymm9, %ymm14 vbroadcastsd 392(%rdx), %ymm9 vmulpd %ymm5, %ymm0, %ymm15 vfmadd231pd %ymm11, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm11) + ymm2 vfmadd213pd %ymm4, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm4 vmulpd %ymm3, %ymm0, %ymm4 vbroadcastsd 200(%rdx), %ymm0 vmovupd %ymm2, -32(%rsp) vmovupd %ymm6, -96(%rsp) vmulpd %ymm3, %ymm1, %ymm6 vmulpd %ymm5, %ymm1, %ymm2 vbroadcastsd 136(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm0, %ymm8 ## ymm8 = (ymm0 ymm11) + ymm8 vfmadd231pd %ymm11, %ymm13, %ymm4 ## ymm4 = (ymm13 ymm11) + ymm4 vfmadd213pd %ymm15, %ymm12, %ymm13 ## ymm13 = (ymm12 ymm13) + ymm15 vfmadd231pd %ymm11, %ymm1, %ymm6 ## ymm6 = (ymm1 ymm11) + ymm6 vfmadd213pd %ymm2, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm2 vmulpd %ymm3, %ymm10, %ymm2 vmovupd %ymm13, -128(%rsp) vmovupd %ymm1, (%rsp) vmovapd %ymm0, %ymm1 vmulpd %ymm5, %ymm10, %ymm0 vmovupd %ymm6, -64(%rsp) vbroadcastsd 264(%rdx), %ymm6 vmovupd -32(%rsp), %ymm10 vfmadd213pd %ymm14, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm14 vmovupd %ymm1, 96(%rsp) vbroadcastsd 320(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm11) + ymm2 vfmadd213pd %ymm0, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm0 vmovupd %ymm2, 32(%rsp) vmulpd %ymm3, %ymm1, %ymm14 vmulpd %ymm5, %ymm1, %ymm0 vmovupd %ymm6, 64(%rsp) vbroadcastsd 328(%rdx), %ymm6 vbroadcastsd 144(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm6, %ymm14 ## ymm14 = (ymm6 ymm11) + ymm14 vfmadd213pd %ymm0, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm0 vbroadcastsd 384(%rdx), %ymm0 vmulpd %ymm3, %ymm0, %ymm15 vmulpd %ymm5, %ymm0, %ymm0 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm0, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm0 vbroadcastsd 448(%rdx), %ymm0 vmulpd %ymm3, %ymm0, %ymm13 vmulpd %ymm5, %ymm0, %ymm2 vbroadcastsd 456(%rdx), %ymm0 vmovupd -64(%rsp), %ymm3 vmovupd 32(%rsp), %ymm5 vfmadd231pd %ymm11, %ymm0, %ymm13 ## ymm13 = (ymm0 ymm11) + ymm13 vmovupd 128(%rsi), %ymm11 vfmadd231pd %ymm12, %ymm0, %ymm2 ## ymm2 = (ymm0 ymm12) + ymm2 vbroadcastsd 80(%rdx), %ymm0 vmovupd 160(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm1, %ymm3 ## ymm3 = (ymm1 ymm11) + ymm3 vfmadd231pd %ymm11, %ymm7, %ymm10 ## ymm10 = (ymm7 ymm11) + ymm10 vfmadd213pd -96(%rsp), %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + mem vfmadd213pd (%rsp), %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + mem vmovupd %ymm3, -64(%rsp) vbroadcastsd 208(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm0, %ymm4 ## ymm4 = (ymm0 ymm11) + ymm4 vfmadd213pd -128(%rsp), %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + mem vfmadd231pd %ymm11, %ymm3, %ymm8 ## ymm8 = (ymm3 ymm11) + ymm8 vfmadd213pd 96(%rsp), %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + mem vmovupd %ymm8, -128(%rsp) vbroadcastsd 272(%rdx), %ymm8 vfmadd231pd %ymm11, %ymm8, %ymm5 ## ymm5 = (ymm8 ymm11) + ymm5 vfmadd213pd 64(%rsp), %ymm12, %ymm8 ## ymm8 = (ymm12 ymm8) + mem vmovupd %ymm5, 32(%rsp) vbroadcastsd 336(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm14 ## ymm14 = (ymm5 ymm11) + ymm14 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 400(%rdx), %ymm6 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 464(%rdx), %ymm9 vmovupd %ymm15, -96(%rsp) vfmadd231pd %ymm11, %ymm9, %ymm13 ## ymm13 = (ymm9 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm9, %ymm2 ## ymm2 = (ymm9 ymm12) + ymm2 vbroadcastsd 24(%rdx), %ymm9 vmovupd 192(%rsi), %ymm11 vmovupd 224(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm9, %ymm10 ## ymm10 = (ymm9 ymm11) + ymm10 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 88(%rdx), %ymm7 vmovapd %ymm10, %ymm15 vmovupd -128(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm7, %ymm4 ## ymm4 = (ymm7 ymm11) + ymm4 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vbroadcastsd 152(%rdx), %ymm0 vmovupd %ymm4, (%rsp) vmovupd -64(%rsp), %ymm4 vfmadd231pd %ymm11, %ymm0, %ymm4 ## ymm4 = (ymm0 ymm11) + ymm4 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 216(%rdx), %ymm1 vmovupd %ymm4, -64(%rsp) vmovupd 32(%rsp), %ymm4 vfmadd231pd %ymm11, %ymm1, %ymm10 ## ymm10 = (ymm1 ymm11) + ymm10 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 280(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm4 ## ymm4 = (ymm3 ymm11) + ymm4 vfmadd213pd %ymm8, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm8 vmovupd -96(%rsp), %ymm8 vmovupd %ymm4, 32(%rsp) vbroadcastsd 344(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm14 ## ymm14 = (ymm4 ymm11) + ymm14 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 408(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm8 ## ymm8 = (ymm5 ymm11) + ymm8 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 472(%rdx), %ymm6 vmovupd %ymm8, -96(%rsp) vmovupd -64(%rsp), %ymm8 vfmadd231pd %ymm11, %ymm6, %ymm13 ## ymm13 = (ymm6 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm12) + ymm2 vbroadcastsd 32(%rdx), %ymm6 vmovupd 256(%rsi), %ymm11 vmovupd 288(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 96(%rdx), %ymm9 vmovupd %ymm15, -32(%rsp) vmovupd (%rsp), %ymm15 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 160(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm8 ## ymm8 = (ymm7 ymm11) + ymm8 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vbroadcastsd 224(%rdx), %ymm0 vmovupd %ymm8, -64(%rsp) vmovupd 32(%rsp), %ymm8 vfmadd231pd %ymm11, %ymm0, %ymm10 ## ymm10 = (ymm0 ymm11) + ymm10 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 288(%rdx), %ymm1 vmovupd %ymm10, -128(%rsp) vmovupd -96(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm1, %ymm8 ## ymm8 = (ymm1 ymm11) + ymm8 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 352(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm14 ## ymm14 = (ymm3 ymm11) + ymm14 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 416(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm10 ## ymm10 = (ymm4 ymm11) + ymm10 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 480(%rdx), %ymm5 vmovupd %ymm10, -96(%rsp) vmovupd -32(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm5, %ymm13 ## ymm13 = (ymm5 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm5, %ymm2 ## ymm2 = (ymm5 ymm12) + ymm2 vbroadcastsd 40(%rdx), %ymm5 vmovupd 320(%rsi), %ymm11 vmovupd 352(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm5, %ymm10 ## ymm10 = (ymm5 ymm11) + ymm10 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 104(%rdx), %ymm6 vmovupd %ymm10, -32(%rsp) vmovupd -128(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 168(%rdx), %ymm9 vmovupd %ymm15, (%rsp) vmovupd -64(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 232(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm10 ## ymm10 = (ymm7 ymm11) + ymm10 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vmovupd -96(%rsp), %ymm0 vmovupd %ymm10, -128(%rsp) vbroadcastsd 296(%rdx), %ymm10 vfmadd231pd %ymm11, %ymm10, %ymm8 ## ymm8 = (ymm10 ymm11) + ymm8 vfmadd213pd %ymm1, %ymm12, %ymm10 ## ymm10 = (ymm12 ymm10) + ymm1 vbroadcastsd 360(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm14 ## ymm14 = (ymm1 ymm11) + ymm14 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 424(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm0 ## ymm0 = (ymm3 ymm11) + ymm0 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 488(%rdx), %ymm4 vmovupd %ymm0, -96(%rsp) vmovupd -32(%rsp), %ymm0 vfmadd231pd %ymm11, %ymm4, %ymm13 ## ymm13 = (ymm4 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm4, %ymm2 ## ymm2 = (ymm4 ymm12) + ymm2 vbroadcastsd 48(%rdx), %ymm4 vmovupd 384(%rsi), %ymm11 vmovupd 416(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm4, %ymm0 ## ymm0 = (ymm4 ymm11) + ymm0 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 112(%rdx), %ymm5 vmovupd %ymm0, -32(%rsp) vmovupd (%rsp), %ymm0 vfmadd231pd %ymm11, %ymm5, %ymm0 ## ymm0 = (ymm5 ymm11) + ymm0 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 176(%rdx), %ymm6 vmovupd %ymm0, (%rsp) vmovapd %ymm15, %ymm0 vmovupd -128(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm6, %ymm0 ## ymm0 = (ymm6 ymm11) + ymm0 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 240(%rdx), %ymm9 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 304(%rdx), %ymm7 vmovupd %ymm15, -128(%rsp) vmovupd -96(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm7, %ymm8 ## ymm8 = (ymm7 ymm11) + ymm8 vfmadd213pd %ymm10, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm10 vbroadcastsd 368(%rdx), %ymm10 vfmadd231pd %ymm11, %ymm10, %ymm14 ## ymm14 = (ymm10 ymm11) + ymm14 vfmadd213pd %ymm1, %ymm12, %ymm10 ## ymm10 = (ymm12 ymm10) + ymm1 vbroadcastsd 432(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm15 ## ymm15 = (ymm1 ymm11) + ymm15 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 496(%rdx), %ymm3 vmovupd %ymm15, -96(%rsp) vmovupd (%rsp), %ymm15 vfmadd231pd %ymm11, %ymm3, %ymm13 ## ymm13 = (ymm3 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm3, %ymm2 ## ymm2 = (ymm3 ymm12) + ymm2 vbroadcastsd 56(%rdx), %ymm3 vmovupd 448(%rsi), %ymm11 vmovupd -32(%rsp), %ymm12 vfmadd231pd %ymm11, %ymm3, %ymm12 ## ymm12 = (ymm3 ymm11) + ymm12 vmovupd %ymm12, -32(%rsp) vmovupd 480(%rsi), %ymm12 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 120(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm15 ## ymm15 = (ymm4 ymm11) + ymm15 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 184(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm0 ## ymm0 = (ymm5 ymm11) + ymm0 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 248(%rdx), %ymm6 vmovupd %ymm0, -64(%rsp) vmovupd -128(%rsp), %ymm0 vfmadd231pd %ymm11, %ymm6, %ymm0 ## ymm0 = (ymm6 ymm11) + ymm0 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 312(%rdx), %ymm9 vmovupd %ymm0, -128(%rsp) vbroadcastsd 440(%rdx), %ymm0 vfmadd231pd %ymm11, %ymm9, %ymm8 ## ymm8 = (ymm9 ymm11) + ymm8 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 376(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm14 ## ymm14 = (ymm7 ymm11) + ymm14 vfmadd213pd %ymm10, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm10 vmovupd -96(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm0, %ymm10 ## ymm10 = (ymm0 ymm11) + ymm10 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 504(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm13 ## ymm13 = (ymm1 ymm11) + ymm13 vmovups -32(%rsp), %ymm11 vfmadd231pd %ymm12, %ymm1, %ymm2 ## ymm2 = (ymm1 ymm12) + ymm2 vmovups %ymm11, (%rdi) vmovupd %ymm3, 32(%rdi) vmovupd %ymm15, 64(%rdi) vmovupd %ymm4, 96(%rdi) vmovups -64(%rsp), %ymm4 vmovups -128(%rsp), %ymm3 vmovups %ymm4, 128(%rdi) vmovupd %ymm5, 160(%rdi) vmovups %ymm3, 192(%rdi) vmovupd %ymm6, 224(%rdi) vmovupd %ymm8, 256(%rdi) vmovupd %ymm9, 288(%rdi) vmovupd %ymm14, 320(%rdi) vmovupd %ymm7, 352(%rdi) vmovupd %ymm10, 384(%rdi) vmovupd %ymm0, 416(%rdi) vmovupd %ymm13, 448(%rdi) vmovupd %ymm2, 480(%rdi) addq $152, %rsp vzeroupper retq nop
Okay, thanks.
I'm not sure what LLVM thinks your CPU is. It is using ymm registers and fma instructions, neither of which goldmont can use.
But it also isn't using zmm registers, which icelake-client should be.
How about, start julia normally (i.e. julia
) and:
julia> function checked_sum(x)
s = 0.0
@simd for xᵢ ∈ x
s += xᵢ == xᵢ ? xᵢ : 0.0
end
s
end
checked_sum (generic function with 1 method)
julia> x = rand(128);
julia> @code_native debuginfo=:none checked_sum(x)
?
@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L200 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movq %rax, %rdx leaq 96(%rcx), %rsi vxorpd %xmm0, %xmm0, %xmm0 vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 andq $-16, %rdx movq %rdx, %rdi nopw %cs:(%rax,%rax) nop L80: vmovupd -96(%rsi), %ymm5 vmovupd -64(%rsi), %ymm7 vmovupd -32(%rsi), %ymm9 vmovupd (%rsi), %ymm11 subq $-128, %rsi addq $-16, %rdi vcmpordpd %ymm0, %ymm5, %ymm6 vcmpordpd %ymm0, %ymm7, %ymm8 vcmpordpd %ymm0, %ymm9, %ymm10 vcmpordpd %ymm0, %ymm11, %ymm12 vandpd %ymm5, %ymm6, %ymm5 vandpd %ymm7, %ymm8, %ymm6 vandpd %ymm9, %ymm10, %ymm7 vaddpd %ymm6, %ymm2, %ymm2 vandpd %ymm11, %ymm12, %ymm6 vaddpd %ymm5, %ymm1, %ymm1 vaddpd %ymm7, %ymm3, %ymm3 vaddpd %ymm6, %ymm4, %ymm4 jne L80 vaddpd %ymm1, %ymm2, %ymm0 cmpq %rdx, %rax vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %ymm1, %ymm0, %ymm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddpd %xmm1, %xmm0, %xmm0 je L235 L200: subq %rdx, %rax leaq (%rcx,%rdx,8), %rcx nop L208: vmovsd (%rcx), %xmm1 ## xmm1 = mem[0],zero addq $8, %rcx addq $-1, %rax vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 jne L208 L235: vzeroupper retq nop
Yeah, it definitely isn't recognizing that you have AVX512. That should look like:
# julia> @code_native debuginfo=:none checked_sum(x)
.text
movq 8(%rdi), %rax
testq %rax, %rax
jle L29
movq (%rdi), %rcx
cmpq $32, %rax
jae L34
vxorpd %xmm0, %xmm0, %xmm0
xorl %edx, %edx
jmp L240
L29:
vxorps %xmm0, %xmm0, %xmm0
retq
L34:
movq %rax, %rdx
andq $-32, %rdx
vxorpd %xmm0, %xmm0, %xmm0
xorl %esi, %esi
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
vxorpd %xmm4, %xmm4, %xmm4
nop
L64:
vmovupd (%rcx,%rsi,8), %zmm5
vmovupd 64(%rcx,%rsi,8), %zmm6
vmovupd 128(%rcx,%rsi,8), %zmm7
vmovupd 192(%rcx,%rsi,8), %zmm8
vcmpordpd %zmm0, %zmm5, %k1
vcmpordpd %zmm0, %zmm6, %k2
vcmpordpd %zmm0, %zmm7, %k3
vcmpordpd %zmm0, %zmm8, %k4
vmovapd %zmm5, %zmm5 {%k1} {z}
vaddpd %zmm5, %zmm1, %zmm1
vmovapd %zmm6, %zmm5 {%k2} {z}
vaddpd %zmm5, %zmm2, %zmm2
vmovapd %zmm7, %zmm5 {%k3} {z}
vaddpd %zmm5, %zmm3, %zmm3
vmovapd %zmm8, %zmm5 {%k4} {z}
vaddpd %zmm5, %zmm4, %zmm4
addq $32, %rsi
cmpq %rsi, %rdx
jne L64
vaddpd %zmm1, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vaddpd %zmm0, %zmm4, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %xmm1, %xmm0, %xmm0
cmpq %rdx, %rax
je L271
nop
L240:
vmovsd (%rcx,%rdx,8), %xmm1 # xmm1 = mem[0],zero
vcmpordsd %xmm1, %xmm1, %k1
vmovsd %xmm1, %xmm0, %xmm1 {%k1} {z}
vaddsd %xmm1, %xmm0, %xmm0
addq $1, %rdx
cmpq %rdx, %rax
jne L240
L271:
vzeroupper
retq
nopw %cs:(%rax,%rax)
nopl (%rax)
The key differences are
zmm
(vector) registers.k
(mask) registers.I'm curious if the upcoming release version of Julia (1.5) shows the same problem?
This is what the beta version produced
@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L208 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movl %eax, %esi andl $15, %esi movq %rax, %rdx subq %rsi, %rdx vxorpd %xmm0, %xmm0, %xmm0 xorl %esi, %esi vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 nopw %cs:(%rax,%rax) nopl (%rax) L80: vmovupd (%rcx,%rsi,8), %ymm5 vmovupd 32(%rcx,%rsi,8), %ymm6 vmovupd 64(%rcx,%rsi,8), %ymm7 vmovupd 96(%rcx,%rsi,8), %ymm8 vcmpordpd %ymm0, %ymm5, %ymm9 vcmpordpd %ymm0, %ymm6, %ymm10 vcmpordpd %ymm0, %ymm7, %ymm11 vcmpordpd %ymm0, %ymm8, %ymm12 vandpd %ymm5, %ymm9, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vandpd %ymm6, %ymm10, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vandpd %ymm7, %ymm11, %ymm5 vaddpd %ymm5, %ymm3, %ymm3 vandpd %ymm8, %ymm12, %ymm5 vaddpd %ymm5, %ymm4, %ymm4 addq $16, %rsi cmpq %rsi, %rdx jne L80 vaddpd %ymm1, %ymm2, %ymm0 vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddsd %xmm1, %xmm0, %xmm0 cmpq %rdx, %rax je L234 nopw (%rax,%rax) L208: vmovsd (%rcx,%rdx,8), %xmm1 ## xmm1 = mem[0],zero vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 incq %rdx cmpq %rdx, %rax jne L208 L234: vzeroupper retq nop
I'm guessing versioninfo()
also shows goldmont?
Bizarre. Maybe LLVM 10 is needed to recognize your CPU? But obviously it is at least partially recognizing that it has AVX and FMA3.
Could you try
using VectorizationBase
VectorizationBase.FMA3
VectorizationBase.AVX512F
VectorizationBase.REGISTER_SIZE
Version info:
Julia Version 1.5.0-beta1.0 Commit 6443f6c95a (2020-05-28 17:42 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
For the second part
julia> using VectorizationBase [ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]
julia> VectorizationBase.FMA3 true
julia> VectorizationBase.AVX512F true
julia> VectorizationBase.REGISTER_SIZE 64
LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
Okay, so this version is correct about that, but still not producing avx512 code?!?
(Based on your reported checked_sum
results). Bizarre.
Could you see if you still get this error with the beta Julia:
error: inline asm error: This value type register class is not natively supported!
? Given that it isn't generating code as though it is icelake-client, I'm guessing you will.
And does julia -Cicelake-client
work with Julia 1.5-beta?
If it does, can you show me the @code_native debuginfo=:none checked_sum(x)
?
This time it came with this error:
error: couldn't allocate output register for constraint 'v'
Does -Cicelake-client
work with Julia 1.5?
Allocating the output register failed because it tried to allocate a 512-bit output register. Your CPU has 512-bit registers, so this should be possible, but LLVM just thinks it isn't.
I still have Julia 1.4 on Path, is it possible to check without substituting 1.4 ?
/path/to/julia/1.5/beta/julia -Cicelake-client
and you can also try
/path/to/julia/1.5/beta/julia -Cskylake-avx512
Substituting the actual path to the Julia 1.5-beta executable.
Tried with
/path/to/julia/1.5/beta/julia -Cskylake-avx512
Got this error
error: couldn't allocate output register for constraint 'v'
Could you show me the @code_native debuginfo=:none checked_sum(x)
when starting Julia with /path/to/julia/1.5/beta/julia -Cskylake-avx512
?
This bug bothers me, but if you really need to get things done, you can
using Pkg
Pkg.dev("VectorizationBase")
Pkg.build("VectorizationBase")# if it didn't already build
and then edit ~/.julia/dev/VectorizationBase/src/cpu_info.jl
, replacing the line
VectorizationBase.REGISTER_SIZE = 64
with
VectorizationBase.REGISTER_SIZE = 32
This will make some code run slower than it should, but for some reason your LLVM doesn't think 64
is a legal value.
If you try this, let me know if you still get the error with DifferentialEquations.jl's tests.
This is what I got using -Cskylake-avx512
@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L208 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movl %eax, %esi andl $15, %esi movq %rax, %rdx subq %rsi, %rdx vxorpd %xmm0, %xmm0, %xmm0 xorl %esi, %esi vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 nopw %cs:(%rax,%rax) nopl (%rax) L80: vmovupd (%rcx,%rsi,8), %ymm5 vmovupd 32(%rcx,%rsi,8), %ymm6 vmovupd 64(%rcx,%rsi,8), %ymm7 vmovupd 96(%rcx,%rsi,8), %ymm8 vcmpordpd %ymm0, %ymm5, %ymm9 vcmpordpd %ymm0, %ymm6, %ymm10 vcmpordpd %ymm0, %ymm7, %ymm11 vcmpordpd %ymm0, %ymm8, %ymm12 vandpd %ymm5, %ymm9, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vandpd %ymm6, %ymm10, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vandpd %ymm7, %ymm11, %ymm5 vaddpd %ymm5, %ymm3, %ymm3 vandpd %ymm8, %ymm12, %ymm5 vaddpd %ymm5, %ymm4, %ymm4 addq $16, %rsi cmpq %rsi, %rdx jne L80 vaddpd %ymm1, %ymm2, %ymm0 vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddsd %xmm1, %xmm0, %xmm0 cmpq %rdx, %rax je L234 nopw (%rax,%rax) L208: vmovsd (%rcx,%rdx,8), %xmm1 ## xmm1 = mem[0],zero vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 incq %rdx cmpq %rdx, %rax jne L208 L234: vzeroupper retq nop
I don't have Pkg.dev it seems
Sorry, Pkg.develop
.
And the asm above is strange, in that I get what I showed above when I use -Cskylake-avx512
.
It seems to be ignoring the option on your computer.
What you showed is what I get with -Chaswell
, so it seems like LLVM is still using that for some reason instead of skylake-avx512
or icelake-client
.
To reemphasize what I said before, I don't know where the bug is (other than that it isn't in any of the DifferentialEquations libraries)
While editing the file, you should also define REGISTER_COUNT = 16
(instead of 32), and every variable with AVX512
in the name to false
.
Is working after changing the VectorizationBase.REGISTER_SIZE = 32, thank you
You'd get better performance out of some functions if you change the REGISTER_COUNT
as well, because LLVM also thinks the value is 16 instead of 32.
I could automate the workaround, but I don't know if it is a mac, icelake, or mac + icelake problem, or perhaps something else unique to your setup.
These are my current settings do they look right to you ?
const REGISTER_SIZE = 32 const REGISTER_COUNT = 32 const REGISTER_CAPACITY = 2048 const FP256 = false # Is AVX2 fast? const CACHELINE_SIZE = 64 const CACHE_SIZE = (49152, 524288, 6291456) const NUM_CORES = 4 const FMA3 = true const AVX2 = true const AVX512F = true const AVX512ER = false const AVX512PF = false const AVX512VL = true const AVX512BW = true const AVX512DQ = true const AVX512CD = true const SIMD_NATIVE_INTEGERS = true
It looks like that at least the asm error doesn’t complete crash Julia. So maybe VectorizationBase can try to run all the inline asm code and catch it when it errors.
Or maybe VectorizationBase can query CPU information from LLVM directly. @maleadt can LLVM.jl do that?
Or maybe VectorizationBase can query CPU information from LLVM directly. @maleadt can LLVM.jl do that?
What kind of information are you looking for? I'm not sure how LLVM internally queries e.g. vector support, you could look at some optimization passes for that, but there is an easy way to query the CPU 'feature string' at least (not wrapped by LLVM.jl yet, so using the underlying wrappers directly):
julia> using LLVM
julia> unsafe_string(LLVM.API.LLVMGetHostCPUName())
"skylake"
julia> unsafe_string(LLVM.API.LLVMGetHostCPUFeatures())
"+sse2,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,+xsavec,+popcnt,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,+invpcid,-avx512cd,+avx,-vaes,+rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,+sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"
When I queried the CPU 'feature string' this is what it returned:
julia> using LLVM [ Info: Precompiling LLVM [929cbde3-209d-540e-8aea-75f648917ca0]
julia> unsafe_string(LLVM.API.LLVMGetHostCPUName()) "goldmont"
julia> unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()) "+sse2,+cx16,+sahf,-tbm,-avx512ifma,+sha,+gfni,-fma4,+vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,+xsavec,+popcnt,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,+rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,+invpcid,-avx512cd,+avx,+vaes,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,+sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"
I used the "./" operator in a block of code within the same scope as the ODESolver. Is it possible that this corrupted the memory on my computer ?
@maleadt
What kind of information are you looking for?
So, the problem is that the host's CPU features AVX512 support, but that LLVM thinks it is Haswell or something.
Unfortunately, unsafe_string(LLVM.API.LLVMGetHostCPUFeatures())
returns -avx512f and all the other cpu feature flags.
He can start his Julia process (if using 1.5-beta) with -Cicelake-client
or -Cskylake-avx512
, yet it still doesn't support AVX512:
1) Never uses more than 16 floating point registers; avx512 provides 32
2) It does not use mask registers
3) It does not use 512 bit registers.
Do you have any idea what could be going on?
I believe the particular error he was getting was the result of this:
const Vec{W,T} = NTuple{W,Core.VecElement{T}}
@inline function vfmadd231(a::Vec{8,Float64}, b::Vec{8,Float64}, c::Vec{8,Float64})
Base.llvmcall("""%res = call <8 x double> asm "vfmadd231pd \$3, \$2, \$1", "=v,0,v,v"(<8 x double> %2, <8 x double> %1, <8 x double> %0)
ret <8 x double> %res""",
Vec{8,Float64},
Tuple{Vec{8,Float64},Vec{8,Float64},Vec{8,Float64}}, a, b, c)
end
x = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
y = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
z = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
vfmadd231(x,y,z)
@YingboMa This will crash Julia if AVX512 isn't supported.
But tests run in a separate process, which is why it gets reported as an error.
But I could run a script with Base.julia_cmd()
to test.
I can work around the problem by lying and saying his CPU is Haswell-like -- so that it'd only use up to 4
instead of 8
, but it seems it'd be much better to fix the actual problem, to let it use full width vectors, masks, and all the registers.
@olaayeko This is a Haswell-like specification; use it:
const REGISTER_SIZE = 32
const REGISTER_COUNT = 16
const REGISTER_CAPACITY = 512
const FP256 = false # Is AVX2 fast?
const CACHELINE_SIZE = 64
const CACHE_SIZE = (49152, 524288, 6291456)
const NUM_CORES = 4
const FMA3 = true
const AVX2 = true
const AVX512F = false
const AVX512ER = false
const AVX512PF = false
const AVX512VL = false
const AVX512BW = false
const AVX512DQ = false
const AVX512CD = false
const SIMD_NATIVE_INTEGERS = true
Yes it does crash, I get this error
error: inline asm error: This value type register class is not natively supported!
I am failing to use DifferentialEquations.jl (v6.14.0) in Julia 1.4.1 on Mac: