asmjit / asmjit

Low-latency machine code generation
https://asmjit.com
zlib License
3.95k stars 500 forks source link

Tail call optimization #447

Open xmaton opened 4 weeks ago

xmaton commented 4 weeks ago

Issue Description

I have started a rough proof-of-concept of tail calls.

https://github.com/asmjit/asmjit/compare/master...xmaton:asmjit:tailcall

I have few design questions as briefly discussed on Gitter.

Tail call optimization

Musttail vs tailcall; i.e. if a tail call cannot be performed for whatever reasons, then does the compilation fail or is a normal call sequence produced? Personally, in my projects I need the musttail semantics.

Tail call eligibility If the callee has only register arguments (based on a calling convention; i.e. 4 on x86win64) then if does not matter what is a caller’s arity (as argument registers are scratch so we can freely overwrite them). If the callee has also stack arguments then its arity must be less or equal to caller’s arity. This way we can reuse caller’s argument stack slots. In theory both caller and callee should have the same return type. That is true from a high-level perspective but it does not affect low level generated code. Not sure if it needs to be checked. There is something about legacy x87 registers RACFGBuilder::onBeforeInvoke that seems to prevent tail calls. What’s that about? Obviously, immediately following InvokeNode there must be a ret node. How and when to test for it? Where exactly should the eligibility test be performed? When the tail call flag is set? Note that none of the eligibility tests have been coded yet.

The design choice is to keep InvokeNode and extend it with a flag. I have added a simple Boolean flag as InstOptions is full. Also extending NodeType does not seem right to me. The flag is provided after InvokeNode is constructed. Should it be moved earlier? The InvokeNode translates to a call instruction. In the tail call case it translates to a jump instruction. I simply call setId to set the correct instruction id. There must be a better way that works for all supported architectures.

EmitEpilog Today only a single epilog sequence is generated at the end of a function. The tail calls need their own epilog. I have changed the epilog interface and added a boolean flag that signals whether a ret instruction is generated. BaseRAPass::insertPrologEpilog searches for all tail calls in order to insert epilogs. Is the search acceptable or should the set of all tail calls be cached/precomputed? Is there an alternative design?

Testing To be added

ARM To be added

Example usage:

InvokeNode* i2;
cc.invoke(&i2, reinterpret_cast<uint64_t>(myExternalFunc), FuncSignature::build<int, int, int>());
i2->setTailCall();
i2->setId(x86::Inst::kIdJmp);
kobalicek commented 2 days ago

That's something I would consider in AsmJit, but it needs the following:

Are you willing to work on these?