dylibso / chicory

Native JVM WebAssembly runtime
Apache License 2.0
315 stars 27 forks source link

AOT Engine error: `Method too large` #405

Open bhelx opened 1 week ago

bhelx commented 1 week ago

This is a known error, but wanted to create this for tracking and discussion:

Method too large: com/dylibso/chicory/$gen/CompiledModule.func_153

I believe the strategy here is to split the method up after it we finish it (assuming it breaks the limit). There is some prior art for this that works with ASM, but is fairly old. I assume there are some modern open source optimizers or tools that work with ASM we can use here.

electrum commented 1 week ago

I'm working on a solution for this, with the goal to get Python running in AOT. The Python interpreter has a single huge method for its interpreter loop. Do you have other examples of WASM programs with large functions?

I looked at msplit, but couldn't get it to work, and I'm not sure it's the right approach anyway, as it's trying to split arbitrary bytecode that would never be generated by our AOT.

My current idea is to have a different code generation mode for large functions:

electrum commented 1 week ago

WASM locals are actually simpler than JVM locals, as they are declared up front for the function, whereas JVM locals can be reused with different types. Similarly, WASM blocks are more restrictive than JVM bytecode, as you can only jump to the start or end of a block, but in JVM bytecode there are fewer restrictions.

So while the idea of splitting bytecode seems nice from separation of concerns perspective, it seems more difficult to do, as you need to analyze it and rediscover the constraints from the original WASM. I'm exploring both approaches.

evacchi commented 1 week ago

I wonder if these old projects had a specific way to deal with this problem at codegen time

electrum commented 1 week ago

@evacchi Thanks for the pointers. Cibyl doesn't seem to handle large methods, which likely didn't matter as it was a developer environment targeting J2ME phones. You could adjust the original source code if the method size was a problem.

I remember reading the NestedVM paper many years ago. Reading it again, it's very interesting to see how similar it is to WebAssembly. They represent CPU registers as class fields and structure the code execution using the original binary addresses with an explicit program counter. The method size limit is worked around by using a trampoline transformation.

NestedVM is implementing an efficient interpreter for MIPS binaries, retaining their original semantics, whereas WASM is defined in a more abstract way and is intended to be efficiently transformed to the target environment. We do an idiomatic transformation to JVM bytecode rather than trying to implement an interpreter for the WASM abstract machine. This should allow the JVM JIT to generate code that is as efficient as possible.

WASM is a stack machine with a very similar structure to JVM bytecode, which allows us to translate in simple manner, instruction at a time, without needing to build an IR and full compiler. This is why we use static methods and pass Instance and Memory explicitly, rather than using a stateful class. Calling an instance method requires the receiver object to be pushed first, before the other parameters. But because we translate directly, by the time we get to the CALL instruction, the other parameters are already on the stack. So instead of an instance method, we use static methods with the special parameters at the end (and thus pushed last). This is why I'm leaning towards using an explicit state object rather than a stateful class.

bhelx commented 1 week ago

I'm working on a solution for this, with the goal to get Python running in AOT. The Python interpreter has a single huge method for its interpreter loop. Do you have other examples of WASM programs with large functions?

I tried this with my SNES emulator project. It maybe has a large function for the same reason as python. a large emulator loop. I'll upload the repo tomorrow if that's helpful.

My current idea is to have a different code generation mode for large functions:

That's an interesting idea. I hadn't really considered it because my assumption was you need to know the structure of the whole function before you can split it. It makes sense that msplit maybe doesn't work anymore. It's been a while since it's had an update.

electrum commented 1 week ago

I'll upload the repo tomorrow if that's helpful.

Please do!

andreaTP commented 1 week ago

Now that I read the evolution of this @enebo might be able to share his experience with Ruby here.

bhelx commented 1 week ago

@electrum Here is the SNES repo and an AOT branch https://github.com/dylibso/chicory-snes/tree/aot

you'll need to LEGALLY acquire an SNES rom to run it!

enebo commented 1 week ago

JRuby has had this issue but it has been rare enough that we never had enough motivation to address it. It only happened in a single case which was a project which used the state machine ragel to generate a Ruby parser. Our fallback when we cannot compile is to just interpret the code. So that library never ran very fast in JRuby. Prism came to the rescue so now that case has left the building.

We have also spent time examining bytecode to reduce how much we emit (and invokedynamic can help reduce overall bytecode). This is more of a mitigation than an actual solution.

JRuby's IR is a register-based IR and those registers end up being locals. For us to implement something like method splitting we have to deal with this by likely backing those into a primitive array and generating synthetic methods for the splits. Our IR design had not considered the need for it so I think we would do something different if we started again. Splitting is also desirable from the standpoint that uncommon paths can be split out so that the JVM can see a smaller common path method.

andreaTP commented 3 days ago

Thanks for sharing your experience @enebo !

Our fallback when we cannot compile is to just interpret the code.

Let see how this goes, but this is a safe net we might want to implement for multiple reasons.