Fine-grained recompilation for numerical constants

JMS55 commented 1 year ago

Feature

Fine-grained recompilation for numerical constants.

Benefit

When developing, especially in creative contexts like games or visual demos, you often want to tweak a constant integer value and then rerun the application.

While live-reloading the value from within the application is superior, this is often hard to setup, and application specific. Furthermore, even if your application supports live-reloading, it may not support it for arbitrary numbers buried within an internal algorithm somewhere.

Having super quick compile times when only tweaking an integer would be a huge improvement to the development cycle.

Implementation

I'm not familiar with how cranelift works, but in the context of the recent stencil changes (https://www.reddit.com/r/rust/comments/zmqj0g/cranelift_progress_in_2022/), I imagine a way where the compiler could detect only an integer changed, and as fixed-size numericals take up constant space regardless of their value, patch in the new value either directly into the binary, or at least decreasing compile work.

bjorn3 commented 1 year ago

I believe you could create a global and then use the global_value instruction to get the integer value. This should create a relocation which you can fill with the concrete integer. While this is generally used for addresses to data objects, there is nothing that says you have to store an address in a global (I believe it does currently require it to be the same size as a pointer, but that shouldn't be too big of a deal). I don't think using stencils would work well as optimizations frequently fold and remove constants. You have to make all optimizations blind to the concrete value, which using the global_value instruction does.

cfallin commented 1 year ago

We could likely do this at least for vector constants, and in most cases floating point constants, without too much trouble, as the fixups can patch over the constant pool. Integers can be a little trickier as on some architectures we can encode integers only of a certain range given a certain instruction; in other words, isel choices may depend on specific integer values, and not the presence of some opaque constant.

This and the need for some optimizations to see constant values makes me wonder: maybe there's room for a "post-facto parameterization" kind of abstraction? What I mean by this is: start the compilation with the constant value fully visible; but produce machine code with some fixup record that, essentially, states "constant N from input instruction I was placed at this offset / these instruction bits and output is completely invariant to it [within range X..Y]". In other words, the compiler produces a more generic output than is actually required, and the caching then somehow takes this into account when generating the key.

Such an output could in principle encode, e.g., aarch64's 12-bit or 16-bit immediates: return 42 could be compiled to movz x0, #42; ret with an auxiliary record stating "invariant to constant value, as long as in range [0, 65536]; patch value into bits 0..16 of u32 at offset 0".

This needs more thought especially around how the cache keying works, but it doesn't seem completely impossible. Curious what @bnjbvr and others think!

jameysharp commented 1 year ago

I love @cfallin's idea of plumbing range information through to allow different architectures' encodings of constants to be updated in-place, but I think @bjorn3 is on to something here. I expect the compiler infrastructure for this is simpler if we can just tell the generated code to get constants from someplace that's easier to patch than instruction immediates.

A compile mode that puts all constants into the constant pool would generate slower code, but could let us treat constants as part of the stencil parameters in a backend-independent way. If you're already doing unoptimized builds because you care about rapid compile-test cycles, I think that would be a worthwhile tradeoff.

Neat idea, @JMS55!

PoignardAzur commented 1 year ago

A compile mode that puts all constants into the constant pool would generate slower code, but could let us treat constants as part of the stencil parameters in a backend-independent way. If you're already doing unoptimized builds because you care about rapid compile-test cycles, I think that would be a worthwhile tradeoff.

I might be stating the obvious, but if you want to have slightly optimized code while keeping previous work, you could a pass of peephole optimizations after saving/retrieving the stencil.

So presumably you would be saving, say, the VCode buffer after register allocation, and things like "load this constant in a register or an immediate" could be handled by the ISLE lowerings. It wouldn't quite produce optimized code (register pressure would be pessimistic, because the register allocator would have to assume that you need a register per constant), but it might be a sweet spot. I don't know what the profiles for cranelift look like, but I assume that if you skip egraph transformations and regalloc you're only using a fraction of the time per function.

It might still be too costly if what you want is, say, hot reloading in a video game, but it might be interesting if you use the stencil to compile several monomorphizations of the same generic function.

bnjbvr commented 1 year ago

I like the idea of the range of instructions to patch to get a constant, but indeed as @bjorn3 points it might be overkill for the use case. Thinking about the very specific case of Rust compiled to WASM, I think marking a global static as pub and no_mangle will expose it as a WASM global under the same name, so if there's a wasmtime API function that allowed changing the value of the global at runtime, then we could even use that while the application is running (through an host function), and not have to hot-reload at all. Of course that implies an execution model where the application is aware that this global value may change at any time, so it might need to poll it, but it would be quite powerful!

And there's another argument about magnitudes: in our large application, the time it takes to hot-reload is almost 99% in the user-side instantiation (think: calling an init() entrypoint function in the wasm module) rather than recompilation + wasm instantiation, so this specific embedding wouldn't benefit from such an optimization.

bytecodealliance / wasmtime