Cranelift: Introduce memcpy and memset instructions

bjorn3 commented 1 year ago

Feature

Introduce instructions that behave like memcpy and memset. These should lower to repe movsb and repe stosb for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.

Benefit

Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.

Implementation

The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function if not available.

jameysharp commented 1 year ago

I have a few notes if somebody else takes this on. (I suspect bjorn3 already knows all this, having written some of it.)

In Cranelift, the ABIMachineSpec trait implemented by backends has a gen_memcpy method. It's used during codegen for function calls by Caller::emit_copy_regs_to_buffer, so I suspect it's well-covered by tests. On x86 and aarch64 it always calls the library memcpy function, but since it's backend-specific it could do something else. (I didn't check the other backends, but they're probably the same.) Making this code available to frontends like cg-clif sounds good to me.

The cranelift_frontend::FunctionBuilder type has methods call_mem{cmp,cpy,move,set} for unconditionally emitting a library call. It also offers emit_small_{memory_compare,memory_copy,memset} to generate an unrolled CLIF loop for buffers smaller than a threshold, falling back to using call_* for larger buffers. I don't see any significant uses of any of these functions in Cranelift or Wasmtime, at any time in the git history. So they probably aren't well tuned and might not even work in general.

I think you're right that the best way to expose these is with CLIF instructions, rather than cranelift_frontend methods. But I'm curious if anybody (like @cfallin?) has other suggestions.

bjorn3 commented 1 year ago

I use emitsmall* once in cg_clif. It is definitively not well tuned though. It doesn't support copying 128bit chunks using xmm registers for example.

cfallin commented 1 year ago

Yeah, I think it's reasonable to create CLIF opcodes for these. memcpy/memset are among the canonical primitives you usually get in a compiler's intrinsics; we don't have a separate notion of intrinsic calls, so new opcodes are the way forward. This would then give us one central implementation we could use where needed (e.g. for struct args on the stack, as noted above) and that we could optimize well.

jameysharp commented 1 year ago

In #5564, @Kixiron suggested that these cranelift-frontend functions ought to take separate MemFlags for each address operand. I think that's a good suggestion, but that we should do it with these new proposed instructions instead of putting any more development into the cranelift-frontend versions.

Kixiron commented 1 year ago

I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach since it'd allow everything that the current approach does and some, like automatically lowering calls with known lengths to unrolled versions (essentially giving us emit_small_* for free, but also applicable to const-eval'd and dataflow scenarios)

bytecodealliance / wasmtime