Failed while USING DifferentialEquations.jl in Julia 1.4.1

olaayeko commented 4 years ago

I am failing to use DifferentialEquations.jl (v6.14.0) in Julia 1.4.1 on Mac:

julia> Pkg.test("DifferentialEquations") Testing DifferentialEquations Status /private/var/folders/g9/yf_v1cnn01j5mplvzgxnlpnh0000gn/T/jl_w7zeiw/Manifest.toml [c3fe647b] AbstractAlgebra v0.9.2 [1520ce14] AbstractTrees v0.3.3 [79e6a3ab] Adapt v1.1.0 [ec485272] ArnoldiMethod v0.0.4 [7d9fca2a] Arpack v0.4.0 [68821587] Arpack_jll v3.5.0+3 [4fba245c] ArrayInterface v2.8.7 [4c555306] ArrayLayouts v0.3.4 [aae01518] BandedMatrices v0.15.11 [764a87c0] BoundaryValueDiffEq v2.5.0 [fa961155] CEnum v0.4.1 [a603d957] CanonicalTraits v0.2.1 [d360d2e6] ChainRulesCore v0.8.0 [861a8166] Combinatorics v1.0.2 [bbf7d656] CommonSubexpressions v0.2.0 [34da2185] Compat v3.12.0 [e66e0078] CompilerSupportLibraries_jll v0.3.3+0 [88cd18e8] ConsoleProgressMonitor v0.1.2 [187b0558] ConstructionBase v1.0.0 [adafc99b] CpuId v0.2.2 [9a962f9c] DataAPI v1.3.0 [864edb3b] DataStructures v0.17.18 [bcd4f6db] DelayDiffEq v5.24.1 [2b5f629d] DiffEqBase v6.38.3 [459566f4] DiffEqCallbacks v2.13.3 [5a0ffddc] DiffEqFinancial v2.3.0 [c894b116] DiffEqJump v6.9.1 [77a26b50] DiffEqNoiseProcess v4.2.0 [055956cb] DiffEqPhysics v3.5.0 [163ba53b] DiffResults v1.0.2 [b552c78f] DiffRules v1.0.1 [0c46a032] DifferentialEquations v6.14.0 [c619ae07] DimensionalPlotRecipes v1.2.0 [b4f34e82] Distances v0.9.0 [31c24e10] Distributions v0.23.4 [ffbed154] DocStringExtensions v0.8.2 [d4d017d3] ExponentialUtilities v1.6.0 [1a297f60] FillArrays v0.8.10 [6a86dc24] FiniteDiff v2.3.2 [59287772] Formatting v0.4.1 [f6369f11] ForwardDiff v0.10.10 [069b7b12] FunctionWrappers v1.1.1 [6b9d7cbe] GeneralizedGenerated v0.2.4 [01680d73] GenericSVD v0.3.0 [d25df0c9] Inflate v0.1.2 [42fd0dbc] IterativeSolvers v0.8.4 [82899510] IteratorInterfaceExtensions v1.0.0 [b14d175d] JuliaVariables v0.2.0 [b964fa9f] LaTeXStrings v1.1.0 [2ee39098] LabelledArrays v1.2.2 [23fbe1c1] Latexify v0.13.5 [1d6d02ad] LeftChildRightSiblingTrees v0.1.2 [093fc24a] LightGraphs v1.3.3 [d3d80556] LineSearches v7.0.1 [e6f89c97] LoggingExtras v0.4.1 [bdcacae8] LoopVectorization v0.8.5 [d00139f3] METIS_jll v5.1.0+4 [d8e11817] MLStyle v0.3.1 [1914dd2f] MacroTools v0.5.5 [e1d29d7a] Missings v0.4.3 [961ee093] ModelingToolkit v3.10.1 [46d2c3a1] MuladdMacro v0.2.2 [f9640e96] MultiScaleArrays v1.8.1 [d41bc354] NLSolversBase v7.6.1 [2774e3e8] NLsolve v4.4.0 [77ba4419] NaNMath v0.3.3 [71a1bf82] NameResolution v0.1.3 [6fe1bfb0] OffsetArrays v1.0.4 [4536629a] OpenBLAS_jll v0.3.9+4 [efe28fd5] OpenSpecFun_jll v0.5.3+3 [bac558e1] OrderedCollections v1.2.0 [1dea7af3] OrdinaryDiffEq v5.41.0 [90014a1f] PDMats v0.9.12 [65888b18] ParameterizedFunctions v5.3.0 [d96e819e] Parameters v0.12.1 [e409e4f3] PoissonRandom v0.4.0 [8162dcfd] PrettyPrint v0.1.0 [33c8b6b6] ProgressLogging v0.1.2 [92933f4c] ProgressMeter v1.3.1 [1fd47b50] QuadGK v2.3.1 [e6cf234a] RandomNumbers v1.4.0 [3cdcf5f2] RecipesBase v0.7.0 [731186ca] RecursiveArrayTools v2.4.4 [f2c3362d] RecursiveFactorization v0.1.2 [189a3867] Reexport v0.2.0 [ae029012] Requires v1.0.1 [ae5879a3] ResettableStacks v1.0.0 [79098fc4] Rmath v0.6.1 [f50d1b31] Rmath_jll v0.2.2+1 [f2b01f46] Roots v1.0.2 [21efa798] SIMDPirates v0.8.7 [476501e8] SLEEFPirates v0.5.1 [1bc83da4] SafeTestsets v0.0.1 [699a6c99] SimpleTraits v0.9.2 [a2af1166] SortingAlgorithms v0.3.1 [47a9eef4] SparseDiffTools v1.8.0 [276daf66] SpecialFunctions v0.10.3 [90137ffa] StaticArrays v0.12.3 [2913bbd2] StatsBase v0.32.2 [4c63d2b9] StatsFuns v0.9.5 [9672c7b4] SteadyStateDiffEq v1.5.1 [789caeaf] StochasticDiffEq v6.23.1 [bea87d4a] SuiteSparse_jll v5.4.0+8 [c3572dad] Sundials v4.2.3 [fb77eaff] Sundials_jll v5.2.0+0 [d1185830] SymbolicUtils v0.3.4 [3783bdb8] TableTraits v1.0.0 [5d786b92] TerminalLoggers v0.1.1 [a759f4b9] TimerOutputs v0.5.6 [a2a6695c] TreeViews v0.3.0 [3a884ed6] UnPack v1.0.1 [1986cc42] Unitful v1.2.1 [3d5dd08c] VectorizationBase v0.12.6 [19fa3120] VertexSafeGraphs v0.1.2 [700de1a5] ZygoteRules v0.2.0 [2a0f44e3] Base64 [ade2ca70] Dates [8bb1440f] DelimitedFiles [8ba89e20] Distributed [b77e0a4c] InteractiveUtils [76f85450] LibGit2 [8f399da3] Libdl [37e2e46d] LinearAlgebra [56ddb016] Logging [d6f4376e] Markdown [a63ad114] Mmap [44cfe95a] Pkg [de0858da] Printf [3fa0cd96] REPL [9a3f8284] Random [ea8e919c] SHA [9e88b42a] Serialization [1a1011a3] SharedArrays [6462fe0b] Sockets [2f01184e] SparseArrays [10745b16] Statistics [4607b0f0] SuiteSparse [8dfed614] Test [cf7118a7] UUIDs [4ec0a83e] Unicode Test Summary: | Pass Total Default Discrete Algorithm | 1 1 5.772186 seconds (16.05 M allocations: 868.707 MiB, 5.30% gc time) error: inline asm error: This value type register class is not natively supported! ERROR: Package DifferentialEquations errored during testing Stacktrace: [1] pkgerror(::String, ::Vararg{String,N} where N) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Types.jl:53 [2] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Operations.jl:1503 [3] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:316 [4] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:303 [5] #test#68 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:297 [inlined] [6] test at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:297 [inlined] [7] #test#67 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:296 [inlined] [8] test at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:296 [inlined] [9] test(::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:295 [10] test(::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/API.jl:295

ChrisRackauckas commented 4 years ago

Inline ASM error? This one might be for @chriselrod or @YingboMa . Can you share versioninfo()?

olaayeko commented 4 years ago

Is this what you mean by version info ? :

Julia Version 1.4.1 Commit 381693d3df* (2020-04-14 17:20 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)

chriselrod commented 4 years ago

There seems to be a mismatch:

CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

This is a very new Ice Lake CPU, featuring FMA3 and AVX512F instruction sets among others, but:

LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)

LLVM treats this as a CPU without any modern instruction sets.

CpuId.jl is probably correctly reporting that the CPU is capable of executing certain assembly instructions, causing some code in LoopVectorization to emit them.

But then because a Goldmon CPU is incapable of doing so, LLVM complains. I'm not sure why that line doesn't read:

LLVM: libLLVM-8.0.1 (ORCJIT, icelake)

LLVM 8 is new enough to support icelake. Having your CPU properly recognized as icelake should make a lot of code run much faster.

olaayeko commented 4 years ago

Interesting, is there a workaround ?

chriselrod commented 4 years ago

How did you install Julia?

olaayeko commented 4 years ago

I think I used Home-brew. The differential equations module was working fine until today

chriselrod commented 4 years ago

Could you try installing an official binary instead? Package managers have been known to cause all sorts of problems with dependencies, like llvm.

Something else you could try is starting Julia with

julia -Cicelake-client

chriselrod commented 4 years ago

If you have both the homebrew and official binaries installed, may I suggest you try:

using BenchmarkTools, StaticArrays
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
@benchmark $A * $B

on both versions, and report their respective timings.

Assuming that the official binary works correctly, it should be several times faster than the homebrew version, because LLVM will do a much better job optimizing code when it actually knows what CPU that code is running on (inline ASM issues aside).

olaayeko commented 4 years ago

I uninstalled Julia with home-brew and installed the official binary, but I am still getting the error. The benchmark test produced this

BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0

minimum time: 32.553 ns (0.00% GC) median time: 32.592 ns (0.00% GC) mean time: 33.066 ns (0.00% GC) maximum time: 198.355 ns (0.00% GC)

samples: 10000 evals/sample: 993

chriselrod commented 4 years ago

Could you share your new version info? As well as @code_native A * B ? 33ns doesn't sound bad.

olaayeko commented 4 years ago

Version Information:

Julia Version 1.4.2 Commit 44fa15b150* (2020-05-23 18:35 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)

chriselrod commented 4 years ago

Still goldmont, that is bizarre.

Have you tried starting julia with julia -Cicelake-client?

olaayeko commented 4 years ago

This is the error I got:

olaayeko@Olas-MBP ~ % julia -Cicelake-client ERROR: Your CPU does not support the CX16 instruction, which is required by this version of Julia! This is often due to running inside of a virtualized environment. Please read https://docs.julialang.org/en/stable/devdocs/sysimg/ for more.

chriselrod commented 4 years ago

Could you show me the @code_native from the StaticArrays example? As well as julia -Chelp? You could also try julia -Cskylake-avx512.

olaayeko commented 4 years ago

Julia -Chelp

Available CPUs for this target:

amdfam10 - Select the amdfam10 processor. athlon - Select the athlon processor. athlon-4 - Select the athlon-4 processor. athlon-fx - Select the athlon-fx processor. athlon-mp - Select the athlon-mp processor. athlon-tbird - Select the athlon-tbird processor. athlon-xp - Select the athlon-xp processor. athlon64 - Select the athlon64 processor. athlon64-sse3 - Select the athlon64-sse3 processor. atom - Select the atom processor. barcelona - Select the barcelona processor. bdver1 - Select the bdver1 processor. bdver2 - Select the bdver2 processor. bdver3 - Select the bdver3 processor. bdver4 - Select the bdver4 processor. bonnell - Select the bonnell processor. broadwell - Select the broadwell processor. btver1 - Select the btver1 processor. btver2 - Select the btver2 processor. c3 - Select the c3 processor. c3-2 - Select the c3-2 processor. cannonlake - Select the cannonlake processor. cascadelake - Select the cascadelake processor. core-avx-i - Select the core-avx-i processor. core-avx2 - Select the core-avx2 processor. core2 - Select the core2 processor. corei7 - Select the corei7 processor. corei7-avx - Select the corei7-avx processor. generic - Select the generic processor. geode - Select the geode processor. goldmont - Select the goldmont processor. goldmont-plus - Select the goldmont-plus processor. haswell - Select the haswell processor. i386 - Select the i386 processor. i486 - Select the i486 processor. i586 - Select the i586 processor. i686 - Select the i686 processor. icelake-client - Select the icelake-client processor. icelake-server - Select the icelake-server processor. ivybridge - Select the ivybridge processor. k6 - Select the k6 processor. k6-2 - Select the k6-2 processor. k6-3 - Select the k6-3 processor. k8 - Select the k8 processor. k8-sse3 - Select the k8-sse3 processor. knl - Select the knl processor. knm - Select the knm processor. lakemont - Select the lakemont processor. nehalem - Select the nehalem processor. nocona - Select the nocona processor. opteron - Select the opteron processor. opteron-sse3 - Select the opteron-sse3 processor. penryn - Select the penryn processor. pentium - Select the pentium processor. pentium-m - Select the pentium-m processor. pentium-mmx - Select the pentium-mmx processor. pentium2 - Select the pentium2 processor. pentium3 - Select the pentium3 processor. pentium3m - Select the pentium3m processor. pentium4 - Select the pentium4 processor. pentium4m - Select the pentium4m processor. pentiumpro - Select the pentiumpro processor. prescott - Select the prescott processor. sandybridge - Select the sandybridge processor. silvermont - Select the silvermont processor. skx - Select the skx processor. skylake - Select the skylake processor. skylake-avx512 - Select the skylake-avx512 processor. slm - Select the slm processor. tremont - Select the tremont processor. westmere - Select the westmere processor. winchip-c6 - Select the winchip-c6 processor. winchip2 - Select the winchip2 processor. x86-64 - Select the x86-64 processor. yonah - Select the yonah processor. znver1 - Select the znver1 processor.

Available features for this target:

16bit-mode - 16-bit mode (i8086). 32bit-mode - 32-bit mode (80386). 3dnow - Enable 3DNow! instructions. 3dnowa - Enable 3DNow! Athlon instructions. 64bit - Support 64-bit instructions. 64bit-mode - 64-bit mode (x86_64). adx - Support ADX instructions. aes - Enable AES instructions. atom - Intel Atom processors. avx - Enable AVX instructions. avx2 - Enable AVX2 instructions. avx512bitalg - Enable AVX-512 Bit Algorithms. avx512bw - Enable AVX-512 Byte and Word Instructions. avx512cd - Enable AVX-512 Conflict Detection Instructions. avx512dq - Enable AVX-512 Doubleword and Quadword Instructions. avx512er - Enable AVX-512 Exponential and Reciprocal Instructions. avx512f - Enable AVX-512 instructions. avx512ifma - Enable AVX-512 Integer Fused Multiple-Add. avx512pf - Enable AVX-512 PreFetch Instructions. avx512vbmi - Enable AVX-512 Vector Byte Manipulation Instructions. avx512vbmi2 - Enable AVX-512 further Vector Byte Manipulation Instructions. avx512vl - Enable AVX-512 Vector Length eXtensions. avx512vnni - Enable AVX-512 Vector Neural Network Instructions. avx512vpopcntdq - Enable AVX-512 Population Count Instructions. bmi - Support BMI instructions. bmi2 - Support BMI2 instructions. cldemote - Enable Cache Demote. clflushopt - Flush A Cache Line Optimized. clwb - Cache Line Write Back. clzero - Enable Cache Line Zero. cmov - Enable conditional move instructions. cx16 - 64-bit with cmpxchg16b. ermsb - REP MOVS/STOS are fast. f16c - Support 16-bit floating point conversion instructions. false-deps-lzcnt-tzcnt - LZCNT/TZCNT have a false dependency on dest register. false-deps-popcnt - POPCNT has a false dependency on dest register. fast-11bytenop - Target can quickly decode up to 11 byte NOPs. fast-15bytenop - Target can quickly decode up to 15 byte NOPs. fast-bextr - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.. fast-gather - Indicates if gather is reasonably fast.. fast-hops - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles. fast-lzcnt - LZCNT instructions are as fast as most simple integer ops. fast-partial-ymm-or-zmm-write - Partial writes to YMM/ZMM registers are fast. fast-scalar-fsqrt - Scalar SQRT is fast (disable Newton-Raphson). fast-shld-rotate - SHLD can be used as a faster rotate. fast-variable-shuffle - Shuffles with variable masks are fast. fast-vector-fsqrt - Vector SQRT is fast (disable Newton-Raphson). fma - Enable three-operand fused multiple-add. fma4 - Enable four-operand fused multiple-add. fsgsbase - Support FS/GS Base instructions. fxsr - Support fxsave/fxrestore instructions. gfni - Enable Galois Field Arithmetic Instructions. glm - Intel Goldmont processors. glp - Intel Goldmont Plus processors. idivl-to-divb - Use 8-bit divide for positive values less than 256. idivq-to-divl - Use 32-bit divide for positive values less than 2^32. invpcid - Invalidate Process-Context Identifier. lea-sp - Use LEA for adjusting the stack pointer. lea-uses-ag - LEA instruction needs inputs at AG stage. lwp - Enable LWP instructions. lzcnt - Support LZCNT instruction. macrofusion - Various instructions can be fused with conditional branches. merge-to-threeway-branch - Merge branches to a three-way conditional branch. mmx - Enable MMX instructions. movbe - Support MOVBE instruction. movdir64b - Support movdir64b instruction. movdiri - Support movdiri instruction. mpx - Support MPX instructions. mwaitx - Enable MONITORX/MWAITX timer functionality. nopl - Enable NOPL instruction. pad-short-functions - Pad short functions. pclmul - Enable packed carry-less multiplication instructions. pconfig - platform configuration instruction. pku - Enable protection keys. popcnt - Support POPCNT instruction. prefer-256-bit - Prefer 256-bit AVX instructions. prefetchwt1 - Prefetch with Intent to Write and T1 Hint. prfchw - Support PRFCHW instructions. ptwrite - Support ptwrite instruction. rdpid - Support RDPID instructions. rdrnd - Support RDRAND instruction. rdseed - Support RDSEED instruction. retpoline - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.. retpoline-external-thunk - When lowering an indirect call or branch using a retpoline, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.. retpoline-indirect-branches - Remove speculation of indirect branches from the generated code.. retpoline-indirect-calls - Remove speculation of indirect calls from the generated code.. rtm - Support RTM instructions. sahf - Support LAHF and SAHF instructions. sgx - Enable Software Guard Extensions. sha - Enable SHA instructions. shstk - Support CET Shadow-Stack instructions. slm - Intel Silvermont processors. slow-3ops-lea - LEA instruction with 3 ops or certain registers is slow. slow-incdec - INC and DEC instructions are slower than ADD and SUB. slow-lea - LEA instruction with certain arguments is slow. slow-pmaddwd - PMADDWD is slower than PMULLD. slow-pmulld - PMULLD instruction is slow. slow-shld - SHLD instruction is slow. slow-two-mem-ops - Two memory operand instructions are slow. slow-unaligned-mem-16 - Slow unaligned 16-byte memory access. slow-unaligned-mem-32 - Slow unaligned 32-byte memory access. soft-float - Use software floating point features.. sse - Enable SSE instructions. sse-unaligned-mem - Allow unaligned memory operands with SSE instructions. sse2 - Enable SSE2 instructions. sse3 - Enable SSE3 instructions. sse4.1 - Enable SSE 4.1 instructions. sse4.2 - Enable SSE 4.2 instructions. sse4a - Support SSE 4a instructions. ssse3 - Enable SSSE3 instructions. tbm - Enable TBM instructions. tremont - Intel Tremont processors. vaes - Promote selected AES instructions to AVX512/AVX registers. vpclmulqdq - Enable vpclmulqdq instructions. waitpkg - Wait and pause enhancements. wbnoinvd - Write Back No Invalidate. x87 - Enable X87 float instructions. xop - Enable XOP instructions. xsave - Support xsave instructions. xsavec - Support xsavec instructions. xsaveopt - Support xsaveopt instructions. xsaves - Support xsaves instructions.

olaayeko commented 4 years ago

Sorry I am not really sure what you mean by @code_native

olaayeko commented 4 years ago

With julia -Cskylake-avx512, I still got the same error

chriselrod commented 4 years ago

Could you run:

using BenchmarkTools, StaticArrays
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
@code_native debuginfo=:none A * B

olaayeko commented 4 years ago

@code_native debuginfo=:none A * B

.section TEXT,text,regular,pure_instructions subq $856, %rsp ## imm = 0x358 movq %rdi, %rax vbroadcastsd (%rdx), %ymm10 vmovupd (%rsi), %ymm0 vmovupd %ymm0, -64(%rsp) vmovupd 64(%rsi), %ymm1 vmovupd %ymm1, 64(%rsp) vmulpd %ymm10, %ymm0, %ymm3 vbroadcastsd 8(%rdx), %ymm11 vmulpd %ymm11, %ymm1, %ymm4 vaddpd %ymm4, %ymm3, %ymm4 vmovupd 128(%rsi), %ymm0 vmovupd %ymm0, -128(%rsp) vbroadcastsd 16(%rdx), %ymm12 vmulpd %ymm12, %ymm0, %ymm5 vaddpd %ymm5, %ymm4, %ymm5 vmovupd 192(%rsi), %ymm0 vmovupd %ymm0, -32(%rsp) vbroadcastsd 24(%rdx), %ymm13 vmulpd %ymm13, %ymm0, %ymm6 vaddpd %ymm6, %ymm5, %ymm6 vmovupd 256(%rsi), %ymm0 vmovupd %ymm0, 192(%rsp) vbroadcastsd 32(%rdx), %ymm14 vmulpd %ymm14, %ymm0, %ymm7 vaddpd %ymm7, %ymm6, %ymm7 vmovupd 320(%rsi), %ymm0 vmovupd %ymm0, 32(%rsp) vbroadcastsd 40(%rdx), %ymm15 vmulpd %ymm15, %ymm0, %ymm8 vaddpd %ymm8, %ymm7, %ymm8 vmovupd 384(%rsi), %ymm0 vmovupd %ymm0, 256(%rsp) vbroadcastsd 48(%rdx), %ymm1 vmulpd %ymm1, %ymm0, %ymm9 vaddpd %ymm9, %ymm8, %ymm9 vmovupd 448(%rsi), %ymm2 vmovupd %ymm2, 160(%rsp) vbroadcastsd 56(%rdx), %ymm0 vmulpd %ymm0, %ymm2, %ymm2 vaddpd %ymm2, %ymm9, %ymm2 vmovupd %ymm2, 800(%rsp) vmovupd 32(%rsi), %ymm2 vmovupd %ymm2, 288(%rsp) vmulpd %ymm10, %ymm2, %ymm2 vmovupd 96(%rsi), %ymm3 vmovupd %ymm3, (%rsp) vmulpd %ymm11, %ymm3, %ymm11 vaddpd %ymm11, %ymm2, %ymm2 vmovupd 160(%rsi), %ymm3 vmovupd %ymm3, 224(%rsp) vmulpd %ymm12, %ymm3, %ymm12 vaddpd %ymm12, %ymm2, %ymm2 vmovupd 224(%rsi), %ymm3 vmovupd %ymm3, -96(%rsp) vmulpd %ymm13, %ymm3, %ymm13 vaddpd %ymm13, %ymm2, %ymm2 vmovupd 288(%rsi), %ymm13 vmulpd %ymm14, %ymm13, %ymm14 vaddpd %ymm14, %ymm2, %ymm2 vmovupd 352(%rsi), %ymm3 vmovupd %ymm3, 320(%rsp) vmulpd %ymm15, %ymm3, %ymm15 vaddpd %ymm15, %ymm2, %ymm2 vmovupd 416(%rsi), %ymm3 vmovupd %ymm3, 96(%rsp) vmulpd %ymm1, %ymm3, %ymm1 vaddpd %ymm1, %ymm2, %ymm2 vmovupd 480(%rsi), %ymm1 vmovupd %ymm1, 128(%rsp) vmulpd %ymm0, %ymm1, %ymm0 vaddpd %ymm0, %ymm2, %ymm0 vmovupd %ymm0, 768(%rsp) vbroadcastsd 64(%rdx), %ymm1 vmovupd -64(%rsp), %ymm15 vmulpd %ymm1, %ymm15, %ymm2 vbroadcastsd 72(%rdx), %ymm0 vmulpd 64(%rsp), %ymm0, %ymm4 vaddpd %ymm4, %ymm2, %ymm2 vbroadcastsd 80(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vbroadcastsd 88(%rdx), %ymm5 vmovupd -32(%rsp), %ymm14 vmulpd %ymm5, %ymm14, %ymm6 vaddpd %ymm6, %ymm2, %ymm2 vbroadcastsd 96(%rdx), %ymm6 vmovupd 192(%rsp), %ymm11 vmulpd %ymm6, %ymm11, %ymm7 vaddpd %ymm7, %ymm2, %ymm2 vbroadcastsd 104(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm2, %ymm2 vbroadcastsd 112(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm2, %ymm2 vbroadcastsd 120(%rdx), %ymm9 vmovupd 160(%rsp), %ymm12 vmulpd %ymm9, %ymm12, %ymm10 vaddpd %ymm10, %ymm2, %ymm2 vmovupd %ymm2, 736(%rsp) vmovupd 288(%rsp), %ymm3 vmulpd %ymm1, %ymm3, %ymm1 vmulpd (%rsp), %ymm0, %ymm0 vaddpd %ymm0, %ymm1, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm13, 352(%rsp) vmulpd %ymm6, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 320(%rsp), %ymm7, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 704(%rsp) vbroadcastsd 128(%rdx), %ymm0 vmulpd %ymm0, %ymm15, %ymm1 vbroadcastsd 136(%rdx), %ymm2 vmovupd 64(%rsp), %ymm15 vmulpd %ymm2, %ymm15, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 144(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 152(%rdx), %ymm5 vmulpd %ymm5, %ymm14, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 160(%rdx), %ymm6 vmulpd %ymm6, %ymm11, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 168(%rdx), %ymm7 vmovupd 32(%rsp), %ymm11 vmulpd %ymm7, %ymm11, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 176(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 184(%rdx), %ymm9 vmulpd %ymm9, %ymm12, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 672(%rsp) vmulpd %ymm0, %ymm3, %ymm0 vmovupd (%rsp), %ymm12 vmulpd %ymm2, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm6, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 320(%rsp), %ymm13 vmulpd %ymm7, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 96(%rsp), %ymm14 vmulpd %ymm8, %ymm14, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 640(%rsp) vbroadcastsd 192(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 200(%rdx), %ymm2 vmulpd %ymm2, %ymm15, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 208(%rdx), %ymm4 vmovupd -128(%rsp), %ymm3 vmulpd %ymm4, %ymm3, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 216(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 224(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 232(%rdx), %ymm7 vmulpd %ymm7, %ymm11, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 240(%rdx), %ymm8 vmovupd 256(%rsp), %ymm11 vmulpd %ymm8, %ymm11, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 248(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 608(%rsp) vmulpd 288(%rsp), %ymm0, %ymm0 vmulpd %ymm2, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 224(%rsp), %ymm12 vmulpd %ymm4, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm7, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm8, %ymm14, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 128(%rsp), %ymm15 vmulpd %ymm9, %ymm15, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 576(%rsp) vbroadcastsd 256(%rdx), %ymm0 vmovupd -64(%rsp), %ymm13 vmulpd %ymm0, %ymm13, %ymm1 vbroadcastsd 264(%rdx), %ymm2 vmovupd 64(%rsp), %ymm14 vmulpd %ymm2, %ymm14, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 272(%rdx), %ymm4 vmulpd %ymm4, %ymm3, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 280(%rdx), %ymm5 vmovupd -32(%rsp), %ymm3 vmulpd %ymm5, %ymm3, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 288(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 296(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 304(%rdx), %ymm8 vmulpd %ymm8, %ymm11, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 312(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 544(%rsp) vmovupd 288(%rsp), %ymm11 vmulpd %ymm0, %ymm11, %ymm0 vmulpd (%rsp), %ymm2, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm4, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd -96(%rsp), %ymm12 vmulpd %ymm5, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 320(%rsp), %ymm7, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm9, %ymm15, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 512(%rsp) vbroadcastsd 320(%rdx), %ymm0 vmulpd %ymm0, %ymm13, %ymm1 vbroadcastsd 328(%rdx), %ymm2 vmulpd %ymm2, %ymm14, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 336(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 344(%rdx), %ymm5 vmulpd %ymm5, %ymm3, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 352(%rdx), %ymm6 vmovupd 192(%rsp), %ymm15 vmulpd %ymm6, %ymm15, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 360(%rdx), %ymm7 vmovupd 32(%rsp), %ymm14 vmulpd %ymm7, %ymm14, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 368(%rdx), %ymm8 vmulpd 256(%rsp), %ymm8, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 376(%rdx), %ymm9 vmovupd 160(%rsp), %ymm13 vmulpd %ymm9, %ymm13, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 480(%rsp) vmulpd %ymm0, %ymm11, %ymm0 vmovupd (%rsp), %ymm3 vmulpd %ymm2, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 224(%rsp), %ymm4, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm5, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 352(%rsp), %ymm11 vmulpd %ymm6, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 320(%rsp), %ymm12 vmulpd %ymm7, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 96(%rsp), %ymm8, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd 128(%rsp), %ymm9, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 448(%rsp) vbroadcastsd 384(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 392(%rdx), %ymm2 vmulpd 64(%rsp), %ymm2, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 400(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 408(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 416(%rdx), %ymm6 vmulpd %ymm6, %ymm15, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 424(%rdx), %ymm7 vmulpd %ymm7, %ymm14, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 432(%rdx), %ymm8 vmovupd 256(%rsp), %ymm14 vmulpd %ymm8, %ymm14, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 440(%rdx), %ymm9 vmulpd %ymm9, %ymm13, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmovupd %ymm1, 416(%rsp) vmovupd 288(%rsp), %ymm15 vmulpd %ymm0, %ymm15, %ymm0 vmulpd %ymm2, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 224(%rsp), %ymm13 vmulpd %ymm4, %ymm13, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm6, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmulpd %ymm7, %ymm12, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 96(%rsp), %ymm3 vmulpd %ymm8, %ymm3, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd 128(%rsp), %ymm11 vmulpd %ymm9, %ymm11, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 vmovupd %ymm0, 384(%rsp) vbroadcastsd 448(%rdx), %ymm0 vmulpd -64(%rsp), %ymm0, %ymm1 vbroadcastsd 456(%rdx), %ymm2 vmulpd 64(%rsp), %ymm2, %ymm4 vaddpd %ymm4, %ymm1, %ymm1 vbroadcastsd 464(%rdx), %ymm4 vmulpd -128(%rsp), %ymm4, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vbroadcastsd 472(%rdx), %ymm5 vmulpd -32(%rsp), %ymm5, %ymm6 vaddpd %ymm6, %ymm1, %ymm1 vbroadcastsd 480(%rdx), %ymm6 vmulpd 192(%rsp), %ymm6, %ymm7 vaddpd %ymm7, %ymm1, %ymm1 vbroadcastsd 488(%rdx), %ymm7 vmulpd 32(%rsp), %ymm7, %ymm8 vaddpd %ymm8, %ymm1, %ymm1 vbroadcastsd 496(%rdx), %ymm8 vmulpd %ymm8, %ymm14, %ymm9 vaddpd %ymm9, %ymm1, %ymm1 vbroadcastsd 504(%rdx), %ymm9 vmulpd 160(%rsp), %ymm9, %ymm10 vaddpd %ymm10, %ymm1, %ymm1 vmulpd %ymm0, %ymm15, %ymm0 vmulpd (%rsp), %ymm2, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm4, %ymm13, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd -96(%rsp), %ymm5, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd 352(%rsp), %ymm6, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm7, %ymm12, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm8, %ymm3, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmulpd %ymm9, %ymm11, %ymm2 vaddpd %ymm2, %ymm0, %ymm0 vmovups 800(%rsp), %ymm2 vmovups %ymm2, (%rdi) vmovups 768(%rsp), %ymm2 vmovups %ymm2, 32(%rdi) vmovups 736(%rsp), %ymm2 vmovups %ymm2, 64(%rdi) vmovups 704(%rsp), %ymm2 vmovups %ymm2, 96(%rdi) vmovups 672(%rsp), %ymm2 vmovups %ymm2, 128(%rdi) vmovups 640(%rsp), %ymm2 vmovups %ymm2, 160(%rdi) vmovups 608(%rsp), %ymm2 vmovups %ymm2, 192(%rdi) vmovups 576(%rsp), %ymm2 vmovups %ymm2, 224(%rdi) vmovups 544(%rsp), %ymm2 vmovups %ymm2, 256(%rdi) vmovups 512(%rsp), %ymm2 vmovups %ymm2, 288(%rdi) vmovups 480(%rsp), %ymm2 vmovups %ymm2, 320(%rdi) vmovups 448(%rsp), %ymm2 vmovups %ymm2, 352(%rdi) vmovups 416(%rsp), %ymm2 vmovups %ymm2, 384(%rdi) vmovups 384(%rsp), %ymm2 vmovups %ymm2, 416(%rdi) vmovupd %ymm1, 448(%rdi) vmovupd %ymm0, 480(%rdi) addq $856, %rsp ## imm = 0x358 vzeroupper retq nopw %cs:(%rax,%rax) nop

chriselrod commented 4 years ago

Interesting. A goldmont CPU shouldn't be able to use ymm registers.

Could you try running that again, but this time after starting Julia with julia --math-mode=fast?

olaayeko commented 4 years ago

@code_native debuginfo=:none A * B

.section TEXT,text,regular,pure_instructions subq $152, %rsp vbroadcastsd (%rdx), %ymm0 vmovupd (%rsi), %ymm3 vmovupd 32(%rsi), %ymm5 vbroadcastsd 8(%rdx), %ymm6 vmovupd 64(%rsi), %ymm11 vmovupd 96(%rsi), %ymm12 vbroadcastsd 128(%rdx), %ymm1 vbroadcastsd 192(%rdx), %ymm9 vbroadcastsd 256(%rdx), %ymm10 vbroadcastsd 72(%rdx), %ymm13 vbroadcastsd 16(%rdx), %ymm7 movq %rdi, %rax vmulpd %ymm3, %ymm0, %ymm2 vmulpd %ymm5, %ymm0, %ymm4 vbroadcastsd 64(%rdx), %ymm0 vmulpd %ymm3, %ymm9, %ymm8 vmulpd %ymm5, %ymm9, %ymm14 vbroadcastsd 392(%rdx), %ymm9 vmulpd %ymm5, %ymm0, %ymm15 vfmadd231pd %ymm11, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm11) + ymm2 vfmadd213pd %ymm4, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm4 vmulpd %ymm3, %ymm0, %ymm4 vbroadcastsd 200(%rdx), %ymm0 vmovupd %ymm2, -32(%rsp) vmovupd %ymm6, -96(%rsp) vmulpd %ymm3, %ymm1, %ymm6 vmulpd %ymm5, %ymm1, %ymm2 vbroadcastsd 136(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm0, %ymm8 ## ymm8 = (ymm0 ymm11) + ymm8 vfmadd231pd %ymm11, %ymm13, %ymm4 ## ymm4 = (ymm13 ymm11) + ymm4 vfmadd213pd %ymm15, %ymm12, %ymm13 ## ymm13 = (ymm12 ymm13) + ymm15 vfmadd231pd %ymm11, %ymm1, %ymm6 ## ymm6 = (ymm1 ymm11) + ymm6 vfmadd213pd %ymm2, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm2 vmulpd %ymm3, %ymm10, %ymm2 vmovupd %ymm13, -128(%rsp) vmovupd %ymm1, (%rsp) vmovapd %ymm0, %ymm1 vmulpd %ymm5, %ymm10, %ymm0 vmovupd %ymm6, -64(%rsp) vbroadcastsd 264(%rdx), %ymm6 vmovupd -32(%rsp), %ymm10 vfmadd213pd %ymm14, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm14 vmovupd %ymm1, 96(%rsp) vbroadcastsd 320(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm11) + ymm2 vfmadd213pd %ymm0, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm0 vmovupd %ymm2, 32(%rsp) vmulpd %ymm3, %ymm1, %ymm14 vmulpd %ymm5, %ymm1, %ymm0 vmovupd %ymm6, 64(%rsp) vbroadcastsd 328(%rdx), %ymm6 vbroadcastsd 144(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm6, %ymm14 ## ymm14 = (ymm6 ymm11) + ymm14 vfmadd213pd %ymm0, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm0 vbroadcastsd 384(%rdx), %ymm0 vmulpd %ymm3, %ymm0, %ymm15 vmulpd %ymm5, %ymm0, %ymm0 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm0, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm0 vbroadcastsd 448(%rdx), %ymm0 vmulpd %ymm3, %ymm0, %ymm13 vmulpd %ymm5, %ymm0, %ymm2 vbroadcastsd 456(%rdx), %ymm0 vmovupd -64(%rsp), %ymm3 vmovupd 32(%rsp), %ymm5 vfmadd231pd %ymm11, %ymm0, %ymm13 ## ymm13 = (ymm0 ymm11) + ymm13 vmovupd 128(%rsi), %ymm11 vfmadd231pd %ymm12, %ymm0, %ymm2 ## ymm2 = (ymm0 ymm12) + ymm2 vbroadcastsd 80(%rdx), %ymm0 vmovupd 160(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm1, %ymm3 ## ymm3 = (ymm1 ymm11) + ymm3 vfmadd231pd %ymm11, %ymm7, %ymm10 ## ymm10 = (ymm7 ymm11) + ymm10 vfmadd213pd -96(%rsp), %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + mem vfmadd213pd (%rsp), %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + mem vmovupd %ymm3, -64(%rsp) vbroadcastsd 208(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm0, %ymm4 ## ymm4 = (ymm0 ymm11) + ymm4 vfmadd213pd -128(%rsp), %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + mem vfmadd231pd %ymm11, %ymm3, %ymm8 ## ymm8 = (ymm3 ymm11) + ymm8 vfmadd213pd 96(%rsp), %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + mem vmovupd %ymm8, -128(%rsp) vbroadcastsd 272(%rdx), %ymm8 vfmadd231pd %ymm11, %ymm8, %ymm5 ## ymm5 = (ymm8 ymm11) + ymm5 vfmadd213pd 64(%rsp), %ymm12, %ymm8 ## ymm8 = (ymm12 ymm8) + mem vmovupd %ymm5, 32(%rsp) vbroadcastsd 336(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm14 ## ymm14 = (ymm5 ymm11) + ymm14 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 400(%rdx), %ymm6 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 464(%rdx), %ymm9 vmovupd %ymm15, -96(%rsp) vfmadd231pd %ymm11, %ymm9, %ymm13 ## ymm13 = (ymm9 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm9, %ymm2 ## ymm2 = (ymm9 ymm12) + ymm2 vbroadcastsd 24(%rdx), %ymm9 vmovupd 192(%rsi), %ymm11 vmovupd 224(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm9, %ymm10 ## ymm10 = (ymm9 ymm11) + ymm10 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 88(%rdx), %ymm7 vmovapd %ymm10, %ymm15 vmovupd -128(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm7, %ymm4 ## ymm4 = (ymm7 ymm11) + ymm4 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vbroadcastsd 152(%rdx), %ymm0 vmovupd %ymm4, (%rsp) vmovupd -64(%rsp), %ymm4 vfmadd231pd %ymm11, %ymm0, %ymm4 ## ymm4 = (ymm0 ymm11) + ymm4 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 216(%rdx), %ymm1 vmovupd %ymm4, -64(%rsp) vmovupd 32(%rsp), %ymm4 vfmadd231pd %ymm11, %ymm1, %ymm10 ## ymm10 = (ymm1 ymm11) + ymm10 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 280(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm4 ## ymm4 = (ymm3 ymm11) + ymm4 vfmadd213pd %ymm8, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm8 vmovupd -96(%rsp), %ymm8 vmovupd %ymm4, 32(%rsp) vbroadcastsd 344(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm14 ## ymm14 = (ymm4 ymm11) + ymm14 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 408(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm8 ## ymm8 = (ymm5 ymm11) + ymm8 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 472(%rdx), %ymm6 vmovupd %ymm8, -96(%rsp) vmovupd -64(%rsp), %ymm8 vfmadd231pd %ymm11, %ymm6, %ymm13 ## ymm13 = (ymm6 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm6, %ymm2 ## ymm2 = (ymm6 ymm12) + ymm2 vbroadcastsd 32(%rdx), %ymm6 vmovupd 256(%rsi), %ymm11 vmovupd 288(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 96(%rdx), %ymm9 vmovupd %ymm15, -32(%rsp) vmovupd (%rsp), %ymm15 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 160(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm8 ## ymm8 = (ymm7 ymm11) + ymm8 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vbroadcastsd 224(%rdx), %ymm0 vmovupd %ymm8, -64(%rsp) vmovupd 32(%rsp), %ymm8 vfmadd231pd %ymm11, %ymm0, %ymm10 ## ymm10 = (ymm0 ymm11) + ymm10 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 288(%rdx), %ymm1 vmovupd %ymm10, -128(%rsp) vmovupd -96(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm1, %ymm8 ## ymm8 = (ymm1 ymm11) + ymm8 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 352(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm14 ## ymm14 = (ymm3 ymm11) + ymm14 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 416(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm10 ## ymm10 = (ymm4 ymm11) + ymm10 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 480(%rdx), %ymm5 vmovupd %ymm10, -96(%rsp) vmovupd -32(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm5, %ymm13 ## ymm13 = (ymm5 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm5, %ymm2 ## ymm2 = (ymm5 ymm12) + ymm2 vbroadcastsd 40(%rdx), %ymm5 vmovupd 320(%rsi), %ymm11 vmovupd 352(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm5, %ymm10 ## ymm10 = (ymm5 ymm11) + ymm10 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 104(%rdx), %ymm6 vmovupd %ymm10, -32(%rsp) vmovupd -128(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm6, %ymm15 ## ymm15 = (ymm6 ymm11) + ymm15 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 168(%rdx), %ymm9 vmovupd %ymm15, (%rsp) vmovupd -64(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 232(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm10 ## ymm10 = (ymm7 ymm11) + ymm10 vfmadd213pd %ymm0, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm0 vmovupd -96(%rsp), %ymm0 vmovupd %ymm10, -128(%rsp) vbroadcastsd 296(%rdx), %ymm10 vfmadd231pd %ymm11, %ymm10, %ymm8 ## ymm8 = (ymm10 ymm11) + ymm8 vfmadd213pd %ymm1, %ymm12, %ymm10 ## ymm10 = (ymm12 ymm10) + ymm1 vbroadcastsd 360(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm14 ## ymm14 = (ymm1 ymm11) + ymm14 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 424(%rdx), %ymm3 vfmadd231pd %ymm11, %ymm3, %ymm0 ## ymm0 = (ymm3 ymm11) + ymm0 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 488(%rdx), %ymm4 vmovupd %ymm0, -96(%rsp) vmovupd -32(%rsp), %ymm0 vfmadd231pd %ymm11, %ymm4, %ymm13 ## ymm13 = (ymm4 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm4, %ymm2 ## ymm2 = (ymm4 ymm12) + ymm2 vbroadcastsd 48(%rdx), %ymm4 vmovupd 384(%rsi), %ymm11 vmovupd 416(%rsi), %ymm12 vfmadd231pd %ymm11, %ymm4, %ymm0 ## ymm0 = (ymm4 ymm11) + ymm0 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 112(%rdx), %ymm5 vmovupd %ymm0, -32(%rsp) vmovupd (%rsp), %ymm0 vfmadd231pd %ymm11, %ymm5, %ymm0 ## ymm0 = (ymm5 ymm11) + ymm0 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 176(%rdx), %ymm6 vmovupd %ymm0, (%rsp) vmovapd %ymm15, %ymm0 vmovupd -128(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm6, %ymm0 ## ymm0 = (ymm6 ymm11) + ymm0 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 240(%rdx), %ymm9 vfmadd231pd %ymm11, %ymm9, %ymm15 ## ymm15 = (ymm9 ymm11) + ymm15 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 304(%rdx), %ymm7 vmovupd %ymm15, -128(%rsp) vmovupd -96(%rsp), %ymm15 vfmadd231pd %ymm11, %ymm7, %ymm8 ## ymm8 = (ymm7 ymm11) + ymm8 vfmadd213pd %ymm10, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm10 vbroadcastsd 368(%rdx), %ymm10 vfmadd231pd %ymm11, %ymm10, %ymm14 ## ymm14 = (ymm10 ymm11) + ymm14 vfmadd213pd %ymm1, %ymm12, %ymm10 ## ymm10 = (ymm12 ymm10) + ymm1 vbroadcastsd 432(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm15 ## ymm15 = (ymm1 ymm11) + ymm15 vfmadd213pd %ymm3, %ymm12, %ymm1 ## ymm1 = (ymm12 ymm1) + ymm3 vbroadcastsd 496(%rdx), %ymm3 vmovupd %ymm15, -96(%rsp) vmovupd (%rsp), %ymm15 vfmadd231pd %ymm11, %ymm3, %ymm13 ## ymm13 = (ymm3 ymm11) + ymm13 vfmadd231pd %ymm12, %ymm3, %ymm2 ## ymm2 = (ymm3 ymm12) + ymm2 vbroadcastsd 56(%rdx), %ymm3 vmovupd 448(%rsi), %ymm11 vmovupd -32(%rsp), %ymm12 vfmadd231pd %ymm11, %ymm3, %ymm12 ## ymm12 = (ymm3 ymm11) + ymm12 vmovupd %ymm12, -32(%rsp) vmovupd 480(%rsi), %ymm12 vfmadd213pd %ymm4, %ymm12, %ymm3 ## ymm3 = (ymm12 ymm3) + ymm4 vbroadcastsd 120(%rdx), %ymm4 vfmadd231pd %ymm11, %ymm4, %ymm15 ## ymm15 = (ymm4 ymm11) + ymm15 vfmadd213pd %ymm5, %ymm12, %ymm4 ## ymm4 = (ymm12 ymm4) + ymm5 vbroadcastsd 184(%rdx), %ymm5 vfmadd231pd %ymm11, %ymm5, %ymm0 ## ymm0 = (ymm5 ymm11) + ymm0 vfmadd213pd %ymm6, %ymm12, %ymm5 ## ymm5 = (ymm12 ymm5) + ymm6 vbroadcastsd 248(%rdx), %ymm6 vmovupd %ymm0, -64(%rsp) vmovupd -128(%rsp), %ymm0 vfmadd231pd %ymm11, %ymm6, %ymm0 ## ymm0 = (ymm6 ymm11) + ymm0 vfmadd213pd %ymm9, %ymm12, %ymm6 ## ymm6 = (ymm12 ymm6) + ymm9 vbroadcastsd 312(%rdx), %ymm9 vmovupd %ymm0, -128(%rsp) vbroadcastsd 440(%rdx), %ymm0 vfmadd231pd %ymm11, %ymm9, %ymm8 ## ymm8 = (ymm9 ymm11) + ymm8 vfmadd213pd %ymm7, %ymm12, %ymm9 ## ymm9 = (ymm12 ymm9) + ymm7 vbroadcastsd 376(%rdx), %ymm7 vfmadd231pd %ymm11, %ymm7, %ymm14 ## ymm14 = (ymm7 ymm11) + ymm14 vfmadd213pd %ymm10, %ymm12, %ymm7 ## ymm7 = (ymm12 ymm7) + ymm10 vmovupd -96(%rsp), %ymm10 vfmadd231pd %ymm11, %ymm0, %ymm10 ## ymm10 = (ymm0 ymm11) + ymm10 vfmadd213pd %ymm1, %ymm12, %ymm0 ## ymm0 = (ymm12 ymm0) + ymm1 vbroadcastsd 504(%rdx), %ymm1 vfmadd231pd %ymm11, %ymm1, %ymm13 ## ymm13 = (ymm1 ymm11) + ymm13 vmovups -32(%rsp), %ymm11 vfmadd231pd %ymm12, %ymm1, %ymm2 ## ymm2 = (ymm1 ymm12) + ymm2 vmovups %ymm11, (%rdi) vmovupd %ymm3, 32(%rdi) vmovupd %ymm15, 64(%rdi) vmovupd %ymm4, 96(%rdi) vmovups -64(%rsp), %ymm4 vmovups -128(%rsp), %ymm3 vmovups %ymm4, 128(%rdi) vmovupd %ymm5, 160(%rdi) vmovups %ymm3, 192(%rdi) vmovupd %ymm6, 224(%rdi) vmovupd %ymm8, 256(%rdi) vmovupd %ymm9, 288(%rdi) vmovupd %ymm14, 320(%rdi) vmovupd %ymm7, 352(%rdi) vmovupd %ymm10, 384(%rdi) vmovupd %ymm0, 416(%rdi) vmovupd %ymm13, 448(%rdi) vmovupd %ymm2, 480(%rdi) addq $152, %rsp vzeroupper retq nop

chriselrod commented 4 years ago

Okay, thanks.

I'm not sure what LLVM thinks your CPU is. It is using ymm registers and fma instructions, neither of which goldmont can use.

But it also isn't using zmm registers, which icelake-client should be.

How about, start julia normally (i.e. julia) and:

julia> function checked_sum(x)
           s = 0.0
           @simd for xᵢ ∈ x
               s += xᵢ == xᵢ ? xᵢ : 0.0
           end
           s
       end
checked_sum (generic function with 1 method)

julia> x = rand(128);

julia> @code_native debuginfo=:none checked_sum(x)

?

olaayeko commented 4 years ago

@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L200 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movq %rax, %rdx leaq 96(%rcx), %rsi vxorpd %xmm0, %xmm0, %xmm0 vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 andq $-16, %rdx movq %rdx, %rdi nopw %cs:(%rax,%rax) nop L80: vmovupd -96(%rsi), %ymm5 vmovupd -64(%rsi), %ymm7 vmovupd -32(%rsi), %ymm9 vmovupd (%rsi), %ymm11 subq $-128, %rsi addq $-16, %rdi vcmpordpd %ymm0, %ymm5, %ymm6 vcmpordpd %ymm0, %ymm7, %ymm8 vcmpordpd %ymm0, %ymm9, %ymm10 vcmpordpd %ymm0, %ymm11, %ymm12 vandpd %ymm5, %ymm6, %ymm5 vandpd %ymm7, %ymm8, %ymm6 vandpd %ymm9, %ymm10, %ymm7 vaddpd %ymm6, %ymm2, %ymm2 vandpd %ymm11, %ymm12, %ymm6 vaddpd %ymm5, %ymm1, %ymm1 vaddpd %ymm7, %ymm3, %ymm3 vaddpd %ymm6, %ymm4, %ymm4 jne L80 vaddpd %ymm1, %ymm2, %ymm0 cmpq %rdx, %rax vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %ymm1, %ymm0, %ymm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddpd %xmm1, %xmm0, %xmm0 je L235 L200: subq %rdx, %rax leaq (%rcx,%rdx,8), %rcx nop L208: vmovsd (%rcx), %xmm1 ## xmm1 = mem[0],zero addq $8, %rcx addq $-1, %rax vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 jne L208 L235: vzeroupper retq nop

chriselrod commented 4 years ago

Yeah, it definitely isn't recognizing that you have AVX512. That should look like:

# julia> @code_native debuginfo=:none checked_sum(x)
        .text
        movq    8(%rdi), %rax
        testq   %rax, %rax
        jle     L29
        movq    (%rdi), %rcx
        cmpq    $32, %rax
        jae     L34
        vxorpd  %xmm0, %xmm0, %xmm0
        xorl    %edx, %edx
        jmp     L240
L29:
        vxorps  %xmm0, %xmm0, %xmm0
        retq
L34:
        movq    %rax, %rdx
        andq    $-32, %rdx
        vxorpd  %xmm0, %xmm0, %xmm0
        xorl    %esi, %esi
        vxorpd  %xmm1, %xmm1, %xmm1
        vxorpd  %xmm2, %xmm2, %xmm2
        vxorpd  %xmm3, %xmm3, %xmm3
        vxorpd  %xmm4, %xmm4, %xmm4
        nop
L64:
        vmovupd (%rcx,%rsi,8), %zmm5
        vmovupd 64(%rcx,%rsi,8), %zmm6
        vmovupd 128(%rcx,%rsi,8), %zmm7
        vmovupd 192(%rcx,%rsi,8), %zmm8
        vcmpordpd       %zmm0, %zmm5, %k1
        vcmpordpd       %zmm0, %zmm6, %k2
        vcmpordpd       %zmm0, %zmm7, %k3
        vcmpordpd       %zmm0, %zmm8, %k4
        vmovapd %zmm5, %zmm5 {%k1} {z}
        vaddpd  %zmm5, %zmm1, %zmm1
        vmovapd %zmm6, %zmm5 {%k2} {z}
        vaddpd  %zmm5, %zmm2, %zmm2
        vmovapd %zmm7, %zmm5 {%k3} {z}
        vaddpd  %zmm5, %zmm3, %zmm3
        vmovapd %zmm8, %zmm5 {%k4} {z}
        vaddpd  %zmm5, %zmm4, %zmm4
        addq    $32, %rsi
        cmpq    %rsi, %rdx
        jne     L64
        vaddpd  %zmm1, %zmm2, %zmm0
        vaddpd  %zmm0, %zmm3, %zmm0
        vaddpd  %zmm0, %zmm4, %zmm0
        vextractf64x4   $1, %zmm0, %ymm1
        vaddpd  %zmm1, %zmm0, %zmm0
        vextractf128    $1, %ymm0, %xmm1
        vaddpd  %zmm1, %zmm0, %zmm0
        vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
        vaddpd  %xmm1, %xmm0, %xmm0
        cmpq    %rdx, %rax
        je      L271
        nop
L240:
        vmovsd  (%rcx,%rdx,8), %xmm1    # xmm1 = mem[0],zero
        vcmpordsd       %xmm1, %xmm1, %k1
        vmovsd  %xmm1, %xmm0, %xmm1 {%k1} {z}
        vaddsd  %xmm1, %xmm0, %xmm0
        addq    $1, %rdx
        cmpq    %rdx, %rax
        jne     L240
L271:
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)
        nopl    (%rax)

The key differences are

Uses zmm (vector) registers.
Uses k (mask) registers.

chriselrod commented 4 years ago

I'm curious if the upcoming release version of Julia (1.5) shows the same problem?

olaayeko commented 4 years ago

This is what the beta version produced

@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L208 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movl %eax, %esi andl $15, %esi movq %rax, %rdx subq %rsi, %rdx vxorpd %xmm0, %xmm0, %xmm0 xorl %esi, %esi vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 nopw %cs:(%rax,%rax) nopl (%rax) L80: vmovupd (%rcx,%rsi,8), %ymm5 vmovupd 32(%rcx,%rsi,8), %ymm6 vmovupd 64(%rcx,%rsi,8), %ymm7 vmovupd 96(%rcx,%rsi,8), %ymm8 vcmpordpd %ymm0, %ymm5, %ymm9 vcmpordpd %ymm0, %ymm6, %ymm10 vcmpordpd %ymm0, %ymm7, %ymm11 vcmpordpd %ymm0, %ymm8, %ymm12 vandpd %ymm5, %ymm9, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vandpd %ymm6, %ymm10, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vandpd %ymm7, %ymm11, %ymm5 vaddpd %ymm5, %ymm3, %ymm3 vandpd %ymm8, %ymm12, %ymm5 vaddpd %ymm5, %ymm4, %ymm4 addq $16, %rsi cmpq %rsi, %rdx jne L80 vaddpd %ymm1, %ymm2, %ymm0 vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddsd %xmm1, %xmm0, %xmm0 cmpq %rdx, %rax je L234 nopw (%rax,%rax) L208: vmovsd (%rcx,%rdx,8), %xmm1 ## xmm1 = mem[0],zero vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 incq %rdx cmpq %rdx, %rax jne L208 L234: vzeroupper retq nop

chriselrod commented 4 years ago

I'm guessing versioninfo() also shows goldmont?

Bizarre. Maybe LLVM 10 is needed to recognize your CPU? But obviously it is at least partially recognizing that it has AVX and FMA3.

Could you try

using VectorizationBase
VectorizationBase.FMA3
VectorizationBase.AVX512F
VectorizationBase.REGISTER_SIZE

olaayeko commented 4 years ago

Version info:

Julia Version 1.5.0-beta1.0 Commit 6443f6c95a (2020-05-28 17:42 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)

For the second part

julia> using VectorizationBase [ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]

julia> VectorizationBase.FMA3 true

julia> VectorizationBase.AVX512F true

julia> VectorizationBase.REGISTER_SIZE 64

chriselrod commented 4 years ago

LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)

Okay, so this version is correct about that, but still not producing avx512 code?!? (Based on your reported checked_sum results). Bizarre.

Could you see if you still get this error with the beta Julia:

error: inline asm error: This value type register class is not natively supported!

? Given that it isn't generating code as though it is icelake-client, I'm guessing you will.

chriselrod commented 4 years ago

And does julia -Cicelake-client work with Julia 1.5-beta?

If it does, can you show me the @code_native debuginfo=:none checked_sum(x)?

olaayeko commented 4 years ago

This time it came with this error:

error: couldn't allocate output register for constraint 'v'

chriselrod commented 4 years ago

Does -Cicelake-client work with Julia 1.5?

Allocating the output register failed because it tried to allocate a 512-bit output register. Your CPU has 512-bit registers, so this should be possible, but LLVM just thinks it isn't.

olaayeko commented 4 years ago

I still have Julia 1.4 on Path, is it possible to check without substituting 1.4 ?

chriselrod commented 4 years ago

/path/to/julia/1.5/beta/julia -Cicelake-client and you can also try /path/to/julia/1.5/beta/julia -Cskylake-avx512

Substituting the actual path to the Julia 1.5-beta executable.

olaayeko commented 4 years ago

Tried with

/path/to/julia/1.5/beta/julia -Cskylake-avx512

Got this error

error: couldn't allocate output register for constraint 'v'

chriselrod commented 4 years ago

Could you show me the @code_native debuginfo=:none checked_sum(x) when starting Julia with /path/to/julia/1.5/beta/julia -Cskylake-avx512?

chriselrod commented 4 years ago

This bug bothers me, but if you really need to get things done, you can

using Pkg
Pkg.dev("VectorizationBase")
Pkg.build("VectorizationBase")# if it didn't already build

and then edit ~/.julia/dev/VectorizationBase/src/cpu_info.jl, replacing the line

VectorizationBase.REGISTER_SIZE = 64

with

VectorizationBase.REGISTER_SIZE = 32

This will make some code run slower than it should, but for some reason your LLVM doesn't think 64 is a legal value. If you try this, let me know if you still get the error with DifferentialEquations.jl's tests.

olaayeko commented 4 years ago

This is what I got using -Cskylake-avx512

@code_native debuginfo=:none checked_sum(x) .section TEXT,text,regular,pure_instructions movq 8(%rdi), %rax testq %rax, %rax jle L29 movq (%rdi), %rcx cmpq $16, %rax jae L34 vxorpd %xmm0, %xmm0, %xmm0 xorl %edx, %edx jmp L208 L29: vxorps %xmm0, %xmm0, %xmm0 retq L34: movl %eax, %esi andl $15, %esi movq %rax, %rdx subq %rsi, %rdx vxorpd %xmm0, %xmm0, %xmm0 xorl %esi, %esi vxorpd %xmm1, %xmm1, %xmm1 vxorpd %xmm2, %xmm2, %xmm2 vxorpd %xmm3, %xmm3, %xmm3 vxorpd %xmm4, %xmm4, %xmm4 nopw %cs:(%rax,%rax) nopl (%rax) L80: vmovupd (%rcx,%rsi,8), %ymm5 vmovupd 32(%rcx,%rsi,8), %ymm6 vmovupd 64(%rcx,%rsi,8), %ymm7 vmovupd 96(%rcx,%rsi,8), %ymm8 vcmpordpd %ymm0, %ymm5, %ymm9 vcmpordpd %ymm0, %ymm6, %ymm10 vcmpordpd %ymm0, %ymm7, %ymm11 vcmpordpd %ymm0, %ymm8, %ymm12 vandpd %ymm5, %ymm9, %ymm5 vaddpd %ymm5, %ymm1, %ymm1 vandpd %ymm6, %ymm10, %ymm5 vaddpd %ymm5, %ymm2, %ymm2 vandpd %ymm7, %ymm11, %ymm5 vaddpd %ymm5, %ymm3, %ymm3 vandpd %ymm8, %ymm12, %ymm5 vaddpd %ymm5, %ymm4, %ymm4 addq $16, %rsi cmpq %rsi, %rdx jne L80 vaddpd %ymm1, %ymm2, %ymm0 vaddpd %ymm0, %ymm3, %ymm0 vaddpd %ymm0, %ymm4, %ymm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vaddsd %xmm1, %xmm0, %xmm0 cmpq %rdx, %rax je L234 nopw (%rax,%rax) L208: vmovsd (%rcx,%rdx,8), %xmm1 ## xmm1 = mem[0],zero vcmpordsd %xmm1, %xmm1, %xmm2 vandpd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 incq %rdx cmpq %rdx, %rax jne L208 L234: vzeroupper retq nop

olaayeko commented 4 years ago

I don't have Pkg.dev it seems

chriselrod commented 4 years ago

Sorry, Pkg.develop.

And the asm above is strange, in that I get what I showed above when I use -Cskylake-avx512. It seems to be ignoring the option on your computer. What you showed is what I get with -Chaswell, so it seems like LLVM is still using that for some reason instead of skylake-avx512 or icelake-client.

To reemphasize what I said before, I don't know where the bug is (other than that it isn't in any of the DifferentialEquations libraries)

chriselrod commented 4 years ago

While editing the file, you should also define REGISTER_COUNT = 16 (instead of 32), and every variable with AVX512 in the name to false.

olaayeko commented 4 years ago

Is working after changing the VectorizationBase.REGISTER_SIZE = 32, thank you

chriselrod commented 4 years ago

You'd get better performance out of some functions if you change the REGISTER_COUNT as well, because LLVM also thinks the value is 16 instead of 32.

I could automate the workaround, but I don't know if it is a mac, icelake, or mac + icelake problem, or perhaps something else unique to your setup.

olaayeko commented 4 years ago

These are my current settings do they look right to you ?

const REGISTER_SIZE = 32 const REGISTER_COUNT = 32 const REGISTER_CAPACITY = 2048 const FP256 = false # Is AVX2 fast? const CACHELINE_SIZE = 64 const CACHE_SIZE = (49152, 524288, 6291456) const NUM_CORES = 4 const FMA3 = true const AVX2 = true const AVX512F = true const AVX512ER = false const AVX512PF = false const AVX512VL = true const AVX512BW = true const AVX512DQ = true const AVX512CD = true const SIMD_NATIVE_INTEGERS = true

YingboMa commented 4 years ago

It looks like that at least the asm error doesn’t complete crash Julia. So maybe VectorizationBase can try to run all the inline asm code and catch it when it errors.

Or maybe VectorizationBase can query CPU information from LLVM directly. @maleadt can LLVM.jl do that?

maleadt commented 4 years ago

Or maybe VectorizationBase can query CPU information from LLVM directly. @maleadt can LLVM.jl do that?

What kind of information are you looking for? I'm not sure how LLVM internally queries e.g. vector support, you could look at some optimization passes for that, but there is an easy way to query the CPU 'feature string' at least (not wrapped by LLVM.jl yet, so using the underlying wrappers directly):

julia> using LLVM

julia> unsafe_string(LLVM.API.LLVMGetHostCPUName())
"skylake"

julia> unsafe_string(LLVM.API.LLVMGetHostCPUFeatures())
"+sse2,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,+xsavec,+popcnt,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,+invpcid,-avx512cd,+avx,-vaes,+rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,+sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"

olaayeko commented 4 years ago

When I queried the CPU 'feature string' this is what it returned:

julia> using LLVM [ Info: Precompiling LLVM [929cbde3-209d-540e-8aea-75f648917ca0]

julia> unsafe_string(LLVM.API.LLVMGetHostCPUName()) "goldmont"

julia> unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()) "+sse2,+cx16,+sahf,-tbm,-avx512ifma,+sha,+gfni,-fma4,+vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,+xsavec,+popcnt,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,+rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,+invpcid,-avx512cd,+avx,+vaes,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,+sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"

olaayeko commented 4 years ago

I used the "./" operator in a block of code within the same scope as the ODESolver. Is it possible that this corrupted the memory on my computer ?

chriselrod commented 4 years ago

@maleadt

What kind of information are you looking for?

So, the problem is that the host's CPU features AVX512 support, but that LLVM thinks it is Haswell or something. Unfortunately, unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()) returns -avx512f and all the other cpu feature flags.

He can start his Julia process (if using 1.5-beta) with -Cicelake-client or -Cskylake-avx512, yet it still doesn't support AVX512: 1) Never uses more than 16 floating point registers; avx512 provides 32 2) It does not use mask registers 3) It does not use 512 bit registers.

Do you have any idea what could be going on?

I believe the particular error he was getting was the result of this:

const Vec{W,T} = NTuple{W,Core.VecElement{T}}
@inline function vfmadd231(a::Vec{8,Float64}, b::Vec{8,Float64}, c::Vec{8,Float64})
    Base.llvmcall("""%res = call <8 x double> asm "vfmadd231pd \$3, \$2, \$1", "=v,0,v,v"(<8 x double> %2, <8 x double> %1, <8 x double> %0)
    ret <8 x double> %res""",
    Vec{8,Float64},
    Tuple{Vec{8,Float64},Vec{8,Float64},Vec{8,Float64}}, a, b, c)
end

x = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
y = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
z = ntuple(Val(8)) do _ Core.VecElement(rand()) end;
vfmadd231(x,y,z)

@YingboMa This will crash Julia if AVX512 isn't supported. But tests run in a separate process, which is why it gets reported as an error. But I could run a script with Base.julia_cmd() to test.

I can work around the problem by lying and saying his CPU is Haswell-like -- so that it'd only use up to 4 instead of 8, but it seems it'd be much better to fix the actual problem, to let it use full width vectors, masks, and all the registers.

@olaayeko This is a Haswell-like specification; use it:

const REGISTER_SIZE = 32
const REGISTER_COUNT = 16
const REGISTER_CAPACITY = 512
const FP256 = false # Is AVX2 fast?
const CACHELINE_SIZE = 64
const CACHE_SIZE = (49152, 524288, 6291456)
const NUM_CORES = 4
const FMA3 = true
const AVX2 = true
const AVX512F = false
const AVX512ER = false
const AVX512PF = false
const AVX512VL = false
const AVX512BW = false
const AVX512DQ = false
const AVX512CD = false
const SIMD_NATIVE_INTEGERS = true

olaayeko commented 4 years ago

Yes it does crash, I get this error

error: inline asm error: This value type register class is not natively supported!

SciML / DifferentialEquations.jl

Failed while USING DifferentialEquations.jl in Julia 1.4.1 #623

BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0

minimum time: 32.553 ns (0.00% GC) median time: 32.592 ns (0.00% GC) mean time: 33.066 ns (0.00% GC) maximum time: 198.355 ns (0.00% GC)