Open bjorn3 opened 1 year ago
I have a few notes if somebody else takes this on. (I suspect bjorn3 already knows all this, having written some of it.)
In Cranelift, the ABIMachineSpec
trait implemented by backends has a gen_memcpy
method. It's used during codegen for function calls by Caller::emit_copy_regs_to_buffer
, so I suspect it's well-covered by tests. On x86 and aarch64 it always calls the library memcpy
function, but since it's backend-specific it could do something else. (I didn't check the other backends, but they're probably the same.) Making this code available to frontends like cg-clif sounds good to me.
The cranelift_frontend::FunctionBuilder
type has methods call_mem{cmp,cpy,move,set}
for unconditionally emitting a library call. It also offers emit_small_{memory_compare,memory_copy,memset}
to generate an unrolled CLIF loop for buffers smaller than a threshold, falling back to using call_*
for larger buffers. I don't see any significant uses of any of these functions in Cranelift or Wasmtime, at any time in the git history. So they probably aren't well tuned and might not even work in general.
I think you're right that the best way to expose these is with CLIF instructions, rather than cranelift_frontend
methods. But I'm curious if anybody (like @cfallin?) has other suggestions.
I use emitsmall* once in cg_clif. It is definitively not well tuned though. It doesn't support copying 128bit chunks using xmm registers for example.
Yeah, I think it's reasonable to create CLIF opcodes for these. memcpy/memset are among the canonical primitives you usually get in a compiler's intrinsics; we don't have a separate notion of intrinsic calls, so new opcodes are the way forward. This would then give us one central implementation we could use where needed (e.g. for struct args on the stack, as noted above) and that we could optimize well.
In #5564, @Kixiron suggested that these cranelift-frontend functions ought to take separate MemFlags
for each address operand. I think that's a good suggestion, but that we should do it with these new proposed instructions instead of putting any more development into the cranelift-frontend versions.
I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach since it'd allow everything that the current approach does and some, like automatically lowering calls with known lengths to unrolled versions (essentially giving us emit_small_*
for free, but also applicable to const-eval'd and dataflow scenarios)
Feature
Introduce instructions that behave like memcpy and memset. These should lower to
repe movsb
andrepe stosb
for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.Benefit
Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.
Implementation
The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function if not available.