Archspec support for `JULIA_CPU_TARGET` and `MARCH`

tgamblin commented 2 years ago

Related to #29592. Ping @vchuravy.

This request sprung out of some Spack users attempting to map Spack's targets to JUILA_CPU_TARGET and MARCH See julia/package.py. They were trying too set JULIA_CPU_TARGET to x86_64_v3, but Julia didn't know the name.

Microarchitecture names are getting more complicated, e.g, the glibc developers have introduced x86_64 levels for gcc and clang. Mapping these to all the compilers you might build Julia with is tough. Also, knowing where optimized binaries can be used can be complicated.

Spack builds binaries for specific microarchitectures, and it knows where those binaries can be used. To do this, it uses the archspec library to detect and reason about CPU targets and microarchitecture flags. Archspec provides some useful features:

a standard set of target names (e.g., cascadelake, skylake_avx512, haswell, x86_64, x86_64_v3, etc.)
logic for detecting these names from /proc/cpuinfo and other tools
mappings from uarch target names to compiler names/versions and the appropriate ISA flags for those compilers
a DAG describing the compatibility of different targets (e.g. it knows that cascadelake and canonlake can run skylake_avx512 binaries but not each others')

Most of archspec's knowledge is in microarchitectures.json - look at it to get a feel for the type of data we're talking about here. It would be pretty simple to build Julia bindings on top of microarchitectures.json -- we've separated it out so that other languages can have bindings.

So, TL;DR: it would be great if Julia used archspec names. The immediate benefit for Julia would be support for things like the glibc x86_64 levels, and the ability to use a common set of targets across compilers. The longer term benefit would be that things like Yggdrasil could use the compatibility logic to distribute optimized binaries for native (non-Julia) packages and understand their compatibility across machines.

tgamblin commented 2 years ago

@vchuravy feel free to edit this if you think the description could be improved.

vchuravy commented 2 years ago

Right now we encode and detect CPU names c.f. (https://github.com/JuliaLang/julia/blob/master/src/processor_x86.cpp and https://github.com/JuliaLang/julia/blob/master/src/features_x86.h)

One important note is that in the end we have to translate to whatever LLVM/LLC understands: https://github.com/JuliaLang/julia/blob/dd8d3c7ac205104ba4d160abdf59f4b9ae98d0ab/src/processor.cpp#L41-L42

On LLVM 12 (Julia 1.7+) we recognize:

x86-64         - Select the x86-64 processor.
x86-64-v2      - Select the x86-64-v2 processor.
x86-64-v3      - Select the x86-64-v3 processor.
x86-64-v4      - Select the x86-64-v4 processor.

➜  ~ julia -C help
Available CPUs for this target:

  alderlake      - Select the alderlake processor.
  amdfam10       - Select the amdfam10 processor.
  athlon         - Select the athlon processor.
  athlon-4       - Select the athlon-4 processor.
  athlon-fx      - Select the athlon-fx processor.
  athlon-mp      - Select the athlon-mp processor.
  athlon-tbird   - Select the athlon-tbird processor.
  athlon-xp      - Select the athlon-xp processor.
  athlon64       - Select the athlon64 processor.
  athlon64-sse3  - Select the athlon64-sse3 processor.
  atom           - Select the atom processor.
  barcelona      - Select the barcelona processor.
  bdver1         - Select the bdver1 processor.
  bdver2         - Select the bdver2 processor.
  bdver3         - Select the bdver3 processor.
  bdver4         - Select the bdver4 processor.
  bonnell        - Select the bonnell processor.
  broadwell      - Select the broadwell processor.
  btver1         - Select the btver1 processor.
  btver2         - Select the btver2 processor.
  c3             - Select the c3 processor.
  c3-2           - Select the c3-2 processor.
  cannonlake     - Select the cannonlake processor.
  cascadelake    - Select the cascadelake processor.
  cooperlake     - Select the cooperlake processor.
  core-avx-i     - Select the core-avx-i processor.
  core-avx2      - Select the core-avx2 processor.
  core2          - Select the core2 processor.
  corei7         - Select the corei7 processor.
  corei7-avx     - Select the corei7-avx processor.
  generic        - Select the generic processor.
  geode          - Select the geode processor.
  goldmont       - Select the goldmont processor.
  goldmont-plus  - Select the goldmont-plus processor.
  haswell        - Select the haswell processor.
  i386           - Select the i386 processor.
  i486           - Select the i486 processor.
  i586           - Select the i586 processor.
  i686           - Select the i686 processor.
  icelake-client - Select the icelake-client processor.
  icelake-server - Select the icelake-server processor.
  ivybridge      - Select the ivybridge processor.
  k6             - Select the k6 processor.
  k6-2           - Select the k6-2 processor.
  k6-3           - Select the k6-3 processor.
  k8             - Select the k8 processor.
  k8-sse3        - Select the k8-sse3 processor.
  knl            - Select the knl processor.
  knm            - Select the knm processor.
  lakemont       - Select the lakemont processor.
  nehalem        - Select the nehalem processor.
  nocona         - Select the nocona processor.
  opteron        - Select the opteron processor.
  opteron-sse3   - Select the opteron-sse3 processor.
  penryn         - Select the penryn processor.
  pentium        - Select the pentium processor.
  pentium-m      - Select the pentium-m processor.
  pentium-mmx    - Select the pentium-mmx processor.
  pentium2       - Select the pentium2 processor.
  pentium3       - Select the pentium3 processor.
  pentium3m      - Select the pentium3m processor.
  pentium4       - Select the pentium4 processor.
  pentium4m      - Select the pentium4m processor.
  pentiumpro     - Select the pentiumpro processor.
  prescott       - Select the prescott processor.
  sandybridge    - Select the sandybridge processor.
  sapphirerapids - Select the sapphirerapids processor.
  silvermont     - Select the silvermont processor.
  skx            - Select the skx processor.
  skylake        - Select the skylake processor.
  skylake-avx512 - Select the skylake-avx512 processor.
  slm            - Select the slm processor.
  tigerlake      - Select the tigerlake processor.
  tremont        - Select the tremont processor.
  westmere       - Select the westmere processor.
  winchip-c6     - Select the winchip-c6 processor.
  winchip2       - Select the winchip2 processor.
  x86-64         - Select the x86-64 processor.
  x86-64-v2      - Select the x86-64-v2 processor.
  x86-64-v3      - Select the x86-64-v3 processor.
  x86-64-v4      - Select the x86-64-v4 processor.
  yonah          - Select the yonah processor.
  znver1         - Select the znver1 processor.
  znver2         - Select the znver2 processor.
  znver3         - Select the znver3 processor.

Available features for this target:

  16bit-mode                  - 16-bit mode (i8086).
  32bit-mode                  - 32-bit mode (80386).
  3dnow                       - Enable 3DNow! instructions.
  3dnowa                      - Enable 3DNow! Athlon instructions.
  64bit                       - Support 64-bit instructions.
  64bit-mode                  - 64-bit mode (x86_64).
  adx                         - Support ADX instructions.
  aes                         - Enable AES instructions.
  amx-bf16                    - Support AMX-BF16 instructions.
  amx-int8                    - Support AMX-INT8 instructions.
  amx-tile                    - Support AMX-TILE instructions.
  avx                         - Enable AVX instructions.
  avx2                        - Enable AVX2 instructions.
  avx512bf16                  - Support bfloat16 floating point.
  avx512bitalg                - Enable AVX-512 Bit Algorithms.
  avx512bw                    - Enable AVX-512 Byte and Word Instructions.
  avx512cd                    - Enable AVX-512 Conflict Detection Instructions.
  avx512dq                    - Enable AVX-512 Doubleword and Quadword Instructions.
  avx512er                    - Enable AVX-512 Exponential and Reciprocal Instructions.
  avx512f                     - Enable AVX-512 instructions.
  avx512ifma                  - Enable AVX-512 Integer Fused Multiple-Add.
  avx512pf                    - Enable AVX-512 PreFetch Instructions.
  avx512vbmi                  - Enable AVX-512 Vector Byte Manipulation Instructions.
  avx512vbmi2                 - Enable AVX-512 further Vector Byte Manipulation Instructions.
  avx512vl                    - Enable AVX-512 Vector Length eXtensions.
  avx512vnni                  - Enable AVX-512 Vector Neural Network Instructions.
  avx512vp2intersect          - Enable AVX-512 vp2intersect.
  avx512vpopcntdq             - Enable AVX-512 Population Count Instructions.
  avxvnni                     - Support AVX_VNNI encoding.
  bmi                         - Support BMI instructions.
  bmi2                        - Support BMI2 instructions.
  branchfusion                - CMP/TEST can be fused with conditional branches.
  cldemote                    - Enable Cache Demote.
  clflushopt                  - Flush A Cache Line Optimized.
  clwb                        - Cache Line Write Back.
  clzero                      - Enable Cache Line Zero.
  cmov                        - Enable conditional move instructions.
  cx16                        - 64-bit with cmpxchg16b.
  cx8                         - Support CMPXCHG8B instructions.
  enqcmd                      - Has ENQCMD instructions.
  ermsb                       - REP MOVS/STOS are fast.
  f16c                        - Support 16-bit floating point conversion instructions.
  false-deps-lzcnt-tzcnt      - LZCNT/TZCNT have a false dependency on dest register.
  false-deps-popcnt           - POPCNT has a false dependency on dest register.
  fast-11bytenop              - Target can quickly decode up to 11 byte NOPs.
  fast-15bytenop              - Target can quickly decode up to 15 byte NOPs.
  fast-7bytenop               - Target can quickly decode up to 7 byte NOPs.
  fast-bextr                  - Indicates that the BEXTR instruction is implemented as a single uop with good throughput.
  fast-gather                 - Indicates if gather is reasonably fast.
  fast-hops                   - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles.
  fast-lzcnt                  - LZCNT instructions are as fast as most simple integer ops.
  fast-scalar-fsqrt           - Scalar SQRT is fast (disable Newton-Raphson).
  fast-scalar-shift-masks     - Prefer a left/right scalar logical shift pair over a shift+and pair.
  fast-shld-rotate            - SHLD can be used as a faster rotate.
  fast-variable-shuffle       - Shuffles with variable masks are fast.
  fast-vector-fsqrt           - Vector SQRT is fast (disable Newton-Raphson).
  fast-vector-shift-masks     - Prefer a left/right vector logical shift pair over a shift+and pair.
  fma                         - Enable three-operand fused multiple-add.
  fma4                        - Enable four-operand fused multiple-add.
  fsgsbase                    - Support FS/GS Base instructions.
  fsrm                        - REP MOVSB of short lengths is faster.
  fxsr                        - Support fxsave/fxrestore instructions.
  gfni                        - Enable Galois Field Arithmetic Instructions.
  hreset                      - Has hreset instruction.
  idivl-to-divb               - Use 8-bit divide for positive values less than 256.
  idivq-to-divl               - Use 32-bit divide for positive values less than 2^32.
  invpcid                     - Invalidate Process-Context Identifier.
  kl                          - Support Key Locker kl Instructions.
  lea-sp                      - Use LEA for adjusting the stack pointer.
  lea-uses-ag                 - LEA instruction needs inputs at AG stage.
  lvi-cfi                     - Prevent indirect calls/branches from using a memory operand, and precede all indirect calls/branches from a register with an LFENCE instruction to serialize control flow. Also decompose RET instructions into a POP+LFENCE+JMP sequence..
  lvi-load-hardening          - Insert LFENCE instructions to prevent data speculatively injected into loads from being used maliciously..
  lwp                         - Enable LWP instructions.
  lzcnt                       - Support LZCNT instruction.
  macrofusion                 - Various instructions can be fused with conditional branches.
  mmx                         - Enable MMX instructions.
  movbe                       - Support MOVBE instruction.
  movdir64b                   - Support movdir64b instruction.
  movdiri                     - Support movdiri instruction.
  mwaitx                      - Enable MONITORX/MWAITX timer functionality.
  nopl                        - Enable NOPL instruction.
  pad-short-functions         - Pad short functions.
  pclmul                      - Enable packed carry-less multiplication instructions.
  pconfig                     - platform configuration instruction.
  pku                         - Enable protection keys.
  popcnt                      - Support POPCNT instruction.
  prefer-128-bit              - Prefer 128-bit AVX instructions.
  prefer-256-bit              - Prefer 256-bit AVX instructions.
  prefer-mask-registers       - Prefer AVX512 mask registers over PTEST/MOVMSK.
  prefetchwt1                 - Prefetch with Intent to Write and T1 Hint.
  prfchw                      - Support PRFCHW instructions.
  ptwrite                     - Support ptwrite instruction.
  rdpid                       - Support RDPID instructions.
  rdrnd                       - Support RDRAND instruction.
  rdseed                      - Support RDSEED instruction.
  retpoline                   - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct.
  retpoline-external-thunk    - When lowering an indirect call or branch using a `retpoline`, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature.
  retpoline-indirect-branches - Remove speculation of indirect branches from the generated code.
  retpoline-indirect-calls    - Remove speculation of indirect calls from the generated code.
  rtm                         - Support RTM instructions.
  sahf                        - Support LAHF and SAHF instructions in 64-bit mode.
  serialize                   - Has serialize instruction.
  seses                       - Prevent speculative execution side channel timing attacks by inserting a speculation barrier before memory reads, memory writes, and conditional branches. Implies LVI Control Flow integrity..
  sgx                         - Enable Software Guard Extensions.
  sha                         - Enable SHA instructions.
  shstk                       - Support CET Shadow-Stack instructions.
  slow-3ops-lea               - LEA instruction with 3 ops or certain registers is slow.
  slow-incdec                 - INC and DEC instructions are slower than ADD and SUB.
  slow-lea                    - LEA instruction with certain arguments is slow.
  slow-pmaddwd                - PMADDWD is slower than PMULLD.
  slow-pmulld                 - PMULLD instruction is slow.
  slow-shld                   - SHLD instruction is slow.
  slow-two-mem-ops            - Two memory operand instructions are slow.
  slow-unaligned-mem-16       - Slow unaligned 16-byte memory access.
  slow-unaligned-mem-32       - Slow unaligned 32-byte memory access.
  soft-float                  - Use software floating point features.
  sse                         - Enable SSE instructions.
  sse-unaligned-mem           - Allow unaligned memory operands with SSE instructions.
  sse2                        - Enable SSE2 instructions.
  sse3                        - Enable SSE3 instructions.
  sse4.1                      - Enable SSE 4.1 instructions.
  sse4.2                      - Enable SSE 4.2 instructions.
  sse4a                       - Support SSE 4a instructions.
  ssse3                       - Enable SSSE3 instructions.
  tbm                         - Enable TBM instructions.
  tsxldtrk                    - Support TSXLDTRK instructions.
  uintr                       - Has UINTR Instructions.
  use-aa                      - Use alias analysis during codegen.
  use-glm-div-sqrt-costs      - Use Goldmont specific floating point div/sqrt costs.
  vaes                        - Promote selected AES instructions to AVX512/AVX registers.
  vpclmulqdq                  - Enable vpclmulqdq instructions.
  vzeroupper                  - Should insert vzeroupper instructions.
  waitpkg                     - Wait and pause enhancements.
  wbnoinvd                    - Write Back No Invalidate.
  widekl                      - Support Key Locker wide Instructions.
  x87                         - Enable X87 float instructions.
  xop                         - Enable XOP instructions.
  xsave                       - Support xsave instructions.
  xsavec                      - Support xsavec instructions.
  xsaveopt                    - Support xsaveopt instructions.
  xsaves                      - Support xsaves instructions.

JuliaLang / julia

Archspec support for `JULIA_CPU_TARGET` and `MARCH` #42073