Open LekKit opened 8 months ago
Implemented Zihintpause in 133e45f.
Implemented Zba (interpreter only for now) in 2a57cff
Implemented Zbs (interpreter only for now) in dcb7021
Implemented Zbb (interpreter only for now) in a6d4593
Why implementing vector extension is considered "extremely complex"?
I am not even sure I understand it entirely after reading the spec repeatedly a few times. And it also seems to duplicate every piece of usual scalar instructions, but vectorized? Like some FPU instructions are already complex to emulate and now we need to copy-paste them and make a scalar loop.
Secondly the hardest part is the JIT for me. But maybe I'll invent some better ways, or maybe interpreter will be fast enough so guests with interpreted vectors won't be slower than ones using JITed scalar loop.
Be aware that "extremely complex" != "I won't implement". It means it likely will take a lot of time and that it might be imperfect in regard to perf/quality for even longer.
Anyhow Bitmanip, Zicond, Zcb seem like very good candidates for something that is much easier to implement both in interpreter & JIT, and they are already supported in GCC very well.
And Vector is something like "we'll get there eventually" target rn.
If you want to work on it then no problem. I am simply focusing on other things.
It would help a lot if there existed some test suite for V instructions similar to how riscv-tests
work.
Implemented Zbc (interpreter only, probably no JIT planned) in 08094c5
Now the entire Bitmanip family is supported in the interpreter
Implemented Zcb with partial JIT support in 1f41839
TODO: Test this properly
Implemented Zicond (interpreter only for now) in fc406a9
TODO: Test this properly
Overview on possible Zawrs implementation:
It is highly similar to x86 monitor
/mwait
instructions, however those are usually only usable in ring 0.
Some AMD chips (Starting from Bulldozer?) have monitorx
/mwaitx
that are supposed to be accessible from userspace.
Quoting LLVM commit: The presence of the MONITORX and MWAITX instructions is indicated by CPUID 8000_0001, ECX, bit 29
. I am able to use those instructions on a Zen 1 machine.
Current consumer Intel chips don't have this (I receive SIGILL on i5 6100U). I don't see any potential replacement except umwait
which is only awailable somewhere from 12th gen CPUs.
It seems ARM64 has WFE which is very similar to WRS.NTO on RISC-V which Zawrs provides. I don't know if it's usable in userland. There is a problem that it only works for tiny exclusive reservation sequence.
All things considered a better way to implement it would be to improve dirty memory tracking together with LR/SC handling. Or maybe not implement it at all if the implementation won't be efficient.
There are scalar crypto extensions that are extending atop Bitmanip. It might make sense to implement them; altho my initial evaluation of JITability is fairly low unless we just start inlining generic ALU lowering everywhere.
Implemented Zkr (entropy source CSR).
Overview on how new extensions could be JITed: godbolt link
TLDR:
lea rd, [rs2 + rs1 * 2]
variations on x86 (2 insns are needed for uw variants), compiles 1:1 on arm64andn
/orn
/xnor
have no replacement on x86, but 1:1 replacement on arm64clz
/ctz
are workable but have some nuancesmax
/min
/maxu
/minu
have no replacement and compile to conditional moves (Some codegen could be shared with Zicond)sext.b
, zext.h
, sext.h
compile fairly well. We already JIT zext.h r0, r1
-> andi r0, r1, 0xFFFF
in IR, since IR imm is i32, and a special case peephole optimization could be added toorbit
or neg
etc)bext
has no replacement but easily lowers into srli r0, r1, r2
; andi, r0, r0, 1
, same could be done with bexti
bts
/btr
/btc
), but only imm variants are 1:1 replaceable on arm64cmov
All of those instructions will have a generic IR lowering for less advanced backends, then x86_64 & arm64 backends will incorporate 1:1 variants to actually speed up code which uses those RISC-V extensions.
TODO: Consider scalar crypto instructions
Optimized orc.b
instruction implementation (used in interpreter) in f760ee2. This could be also inlined in JIT.
This instruction is heavily used to accelerate string operations, so having a fast implementation for it is important.
This patch already improves Zbb-optimized Dhrystone score, even tho it's interpreter only yet.
Probably the best possible orc.b
implementation for x86_64: 6a37001
A similar implementation is probably possible on ARM64 using vceqq_u8 instrinsic
UPD: ARM64 neon implementation 3563cbf
static inline uint64_t bit_orc_b(uint64_t val)
{
uint8x8_t in = vreinterpret_u8_u64(vcreate_u64(val));
uint8x8_t orc = vtst_u8(in, in);
return vget_lane_u64(vreinterpret_u64_u8(orc), 0);
}
bit_orc_b_neon:
fmov d0, x0
cmtst v0.8b, v0.8b, v0.8b
fmov x0, d0
ret
This issue is a place for discussions about newer unprivileged ISA extensions
Each extension should be evaluated for following qualities:
Extensible list of ratified ISA extensions beyond rv64imafdc
Bitmanip family
Zba - Bitmanip address generation
Zbb - Bitmanip basic bit-manipulation
Zbc - Bitmanip carry-less multiplication
Zbs - Bitmanip single-bit instructions
Floating-point family
Q - 128-bit IEEE754 floating point
Zfh - 16-bit IEEE754 floating point, Zfhmin - bfloat16
Zfa - Additional floating-point instructions
Vector family
V - Vector Operations
K - Vector Cryptography
Memory/atomics related extensions
Zawrs - Wait on memory reservation (Almost like a hint)
Zacas - Compare and Swap
Zicbom, Zicboz - Cache management (flush, invalidate, prefetch, zero cache block)
Hints
Zihintpause - Pause hint
Extra
Zicond - Integer Conditional operations
Zcb - Code size reduction extension